Tag Archives: JavaScript

9 Things I learnt while moving data from RedShift into AWS Elastic Search with AWS Lambda

The amazon infrastructure is amazing and allows for interesting and cool scaling without the use of servers. It’s exciting to see what can be done. The trick with much of this is that many of the elements are asynchronous and so it can be easy to flood services, particularly when pulling data out of your RedShift data warehouse and putting it into Elastic Search. I’ve learnt a bunch of things while doing this, the salient points are below.

  1. Don’t gzip the data unloaded.
  2. Use the bulk load on elastic
  3. Use a large number of records in the bulk load (>5000) – fewer large bulk loads are better than more smaller ones. When working with AWS elastic search there is a risk of hitting the limits of the bulk queue size.
  4. Process a single file in the lambda and then recursively call the lambda function with an event
  5. Before recursing wait for a couple of seconds –> setTimeout.
  6. When waiting make sure that you aren’t idle for 30 seconds because your lambda will stop.
  7. Don’t use s3 object creation to trigger your lambda — you’ll end up with multiple lambda functions being called at the same time.
  8. Don’t bother trying to put kinesis in the middle – unloading your data into kinesis is almost certain to hit load limits in kinesis.
  9. Monitor your elastic search bulk queue size with something like this:
    curl https://%ES-SERVER:PORT%/_nodes/stats/thread_pool |jq ‘.nodes |to_entries[].value.thread_pool.bulk’

1 Unloading from RedShift

The process of doing the gunzip in the lambda takes time + resources in the lambda function. Avoid this by just storing the CSV in s3 and then streaming it out with S3.getObject(params).createReadStream().

Here is an unload function that works well for me.

UNLOAD ('%SOME_AMAZING_QUERY_FROM_A_BIG_TABLE%')
TO 's3://%BUCKET%/%FOLDER%'
credentials 'aws_access_key_id=%AWS_KEY_ID%;aws_secret_access_key=%AWS_ACCESS_KEY%'
DELIMITER AS ',' NULL AS '' ESCAPE ADDQUOTES;

2 Use the bulk load in elastic

The elastic bulk load operation is your friend. Don’t index each record one at a time and consume lots of resources, instead send up batches at the same time using the bulk operation.

3 Use a large number of records in the bulk load

More than 5000 records at a time in the bulk load is important to do. Fewer big loads is better than more small ones. See https://www.elastic.co/guide/en/elasticsearch/guide/current/bulk.html#_how_big_is_too_big for setting the number and size of this.

4 Process a single file in each lambda function

To ensure that you don’t consume too many resources, process a single file in each lambda function then recurse using

Lambda.invoke({
    FunctionName: context.invokedFunctionArn,
    InvocationType: 'Event',
    Payload: JSON.stringify(payload)
});

Use either the promise version or callback version as preferred. Keep track of where you are in the payload.

5 Wait before recursing

Before doing the callback above wait a couple of seconds to give elastic a chance to catch up.
setTimeout(function(){recurseFunction(event, context, callback)}, 2000);

6 Keep the wait short

If you don’t do anything for 30 seconds Lambda will timeout. Keep the wait short. 2 seconds (as chosen above) wasn’t completely arbitrary.

7 Don’t use s3 object creation to trigger your lambda

One of the things we are seeing consistently is trying to control the rate of data flowing into elastic. Using the s3 object creation triggers for the lambda will result in multiple concurrent calls to your lambda function. This will result in too much at the same time. Trigger the lambda some other way.

8 Kinesis isn’t the answer to this problem

Putting the records to index into kinesis will not act as a good way to control the massive flow of data from redshift to elastic. While kinesis is great for controlling streams of data over time, it’s not really the right component for this scenario of loading lots of records at once. The approach outlined throughout this document is suitable.

9 Monitor your elastic resources with curl and jq

Unix commandline tools rock.

curl and jq are great tools for working with http data. curl for getting data, jq for processing json data.(https://stedolan.github.io/jq/)

elastic provides json apis for seeing the data. The below command is how to look up the information on the bulk queue size.

curl https://%ES-SERVER:PORT%/_nodes/stats/thread_pool |jq '.nodes |to_entries[].value.thread_pool.bulk'

Conclusion

Serverless + the AWS stack is nice — you need to think about how to use it and knowing the tools + capabilities of the platform is important — with care you can do amazing things. Go build some great stuff.

Notes From Yehuda Katz’s visit to Brisbane

Yehuda Katz(http://yehudakatz.com/) had a brief visit to Brisbane Australia, doing a public presentation, and a more private breakfast meeting. In this blog post I’m going over some of the things that struck me as particularly interesting or worth thinking about.

For those of you who don’t know Yehuda, he is an opinionated and very active developer. Yehuda is a member of the JQuery and Ruby on Rails core teams, and he is also one of the founders of Ember.js (a framework for building rich MVC JavaScript applications – http://emberjs.com).

In the public talk Yehuda went through ember.js, talking through what the paradigm of the framework is, and walking through a demo of some of the key features of the framework. It looks like an interesting option for JavaScript applications. It’s on the brink of going 1.0, but already has some high profile applications using the framework. Apart from seeing the interesting elements of ember and how it’s used, it was very interesting to see the way people are using Ember.js with D3: http://corner.squareup.com/2012/04/building-analytics.html

I’m definitely going to keep my on the framework as it moves forward.

In his spare time Yehuda is working to push the future of the web in ways that help facilitate rich applications. He has got a couple of ways that he is doing this:

  1. he is a member of the W3C TAG
  2. he is working to influence members of the chrome team to build things well.

W3C TAG

The Technical Architecture Group works to specify and guide the architecture of the web moving forward. It has the feel of a good internal architecture group in an organisation, filled with smart people trying to make the web better (it’s membership includes Tim Berners-Lee and representatives of the community, large organisations and browser vendors).

Chrome Team

Through some of the work Yehuda has done, he has had opportunity to spend time with some of the people building new versions of Chrome, and helping to guide some of the thinking towards making apis and decisions that work well for web developers.

So I should have convinced you that Yehuda has some stuff worth listening to. Over the period of time he was I had some good opportunities to listen to both his public conversations and hear some of the more informal conversations. Here are some of the things that I found particularly interesting around where he sees the web heading.

Web RTC looks like a cool technology for real time communications (http://www.webrtc.org/). The support for peer connections looks particularly interesting.

The new world being demonstrated by google polymer (http://www.polymer-project.org/) looks to be very exciting. Definitely well worth a look for web developers who want to get an idea of the way they will be writing applications in the future. Model Driven Views (http://www.polymer-project.org/platform/mdv.html), and custom elements (http://www.polymer-project.org/platform/custom-elements.html) are extremely exciting, and the shadow dom (http://www.polymer-project.org/platform/shadow-dom.html) looks like a good tool for helping to support and customise the new features brought in. The HTML + CSS workflow currently is the language of the web, with many people speaking it, and with these tools, I think the language is moving in good directions.

The mechanisms for doing asynchronous JavaScript have been moving on from the straight callback approach that has been familiar to people, particularly through the use of node. There has been much discussion around the web around promises and futures, with things heading towards promises. Martin Fowler has an article describing JavaScript promises (http://martinfowler.com/bliki/JavascriptPromise.html), which is where the W3C TAG is currently headed (http://infrequently.org/2013/06/sfuturepromiseg/). I look forward to having this come into play, and having a standard option that doesn’t involve the deep nesting that can come from callback nesting.

It was interesting hearing Yehuda’s perspective on computer science and functional topics like Monads and functional reactive programming. The binding approach used in Ember.js takes inspiration from FRP, and Promises allow a transformation to a monadic approach.

One of the interesting new things coming to javascript in browsers is object.observe, a feature which will make it possible to observe any object for modifications on it or its attributes.

All in all there is a bunch of interesting stuff in the web future. It’s a great time to be doing web development, and I look forward to what the future holds.