Tag Archives: amazon

9 Things I learnt while moving data from RedShift into AWS Elastic Search with AWS Lambda

The amazon infrastructure is amazing and allows for interesting and cool scaling without the use of servers. It’s exciting to see what can be done. The trick with much of this is that many of the elements are asynchronous and so it can be easy to flood services, particularly when pulling data out of your RedShift data warehouse and putting it into Elastic Search. I’ve learnt a bunch of things while doing this, the salient points are below.

  1. Don’t gzip the data unloaded.
  2. Use the bulk load on elastic
  3. Use a large number of records in the bulk load (>5000) – fewer large bulk loads are better than more smaller ones. When working with AWS elastic search there is a risk of hitting the limits of the bulk queue size.
  4. Process a single file in the lambda and then recursively call the lambda function with an event
  5. Before recursing wait for a couple of seconds –> setTimeout.
  6. When waiting make sure that you aren’t idle for 30 seconds because your lambda will stop.
  7. Don’t use s3 object creation to trigger your lambda — you’ll end up with multiple lambda functions being called at the same time.
  8. Don’t bother trying to put kinesis in the middle – unloading your data into kinesis is almost certain to hit load limits in kinesis.
  9. Monitor your elastic search bulk queue size with something like this:
    curl https://%ES-SERVER:PORT%/_nodes/stats/thread_pool |jq ‘.nodes |to_entries[].value.thread_pool.bulk’

1 Unloading from RedShift

The process of doing the gunzip in the lambda takes time + resources in the lambda function. Avoid this by just storing the CSV in s3 and then streaming it out with S3.getObject(params).createReadStream().

Here is an unload function that works well for me.

credentials 'aws_access_key_id=%AWS_KEY_ID%;aws_secret_access_key=%AWS_ACCESS_KEY%'

2 Use the bulk load in elastic

The elastic bulk load operation is your friend. Don’t index each record one at a time and consume lots of resources, instead send up batches at the same time using the bulk operation.

3 Use a large number of records in the bulk load

More than 5000 records at a time in the bulk load is important to do. Fewer big loads is better than more small ones. See https://www.elastic.co/guide/en/elasticsearch/guide/current/bulk.html#_how_big_is_too_big for setting the number and size of this.

4 Process a single file in each lambda function

To ensure that you don’t consume too many resources, process a single file in each lambda function then recurse using

    FunctionName: context.invokedFunctionArn,
    InvocationType: 'Event',
    Payload: JSON.stringify(payload)

Use either the promise version or callback version as preferred. Keep track of where you are in the payload.

5 Wait before recursing

Before doing the callback above wait a couple of seconds to give elastic a chance to catch up.
setTimeout(function(){recurseFunction(event, context, callback)}, 2000);

6 Keep the wait short

If you don’t do anything for 30 seconds Lambda will timeout. Keep the wait short. 2 seconds (as chosen above) wasn’t completely arbitrary.

7 Don’t use s3 object creation to trigger your lambda

One of the things we are seeing consistently is trying to control the rate of data flowing into elastic. Using the s3 object creation triggers for the lambda will result in multiple concurrent calls to your lambda function. This will result in too much at the same time. Trigger the lambda some other way.

8 Kinesis isn’t the answer to this problem

Putting the records to index into kinesis will not act as a good way to control the massive flow of data from redshift to elastic. While kinesis is great for controlling streams of data over time, it’s not really the right component for this scenario of loading lots of records at once. The approach outlined throughout this document is suitable.

9 Monitor your elastic resources with curl and jq

Unix commandline tools rock.

curl and jq are great tools for working with http data. curl for getting data, jq for processing json data.(https://stedolan.github.io/jq/)

elastic provides json apis for seeing the data. The below command is how to look up the information on the bulk queue size.

curl https://%ES-SERVER:PORT%/_nodes/stats/thread_pool |jq '.nodes |to_entries[].value.thread_pool.bulk'


Serverless + the AWS stack is nice — you need to think about how to use it and knowing the tools + capabilities of the platform is important — with care you can do amazing things. Go build some great stuff.