Converting Elastic Search and/or queries to should/should not queries

One of the significant changes in moving from Elastic Search 2 to 5 is the removal of the deprecated ‘and’, and ‘or’ queries from the dsl, replacing them with nested bool must, and should queries. Performing this migration is a developer task, that probably could be automated. In the absence of automation, here are the steps I’ve followed:

  1. replace and: with a nested bool: must:
  2. replace or: with a nested bool: should:

In practice this will take a query that might be something like:

{
  "index": "people",
  "body": {
    "query": {
      "and": [
        {
          "or": [
            {"match": {"first_name": "Fred"}},
            {"match": {"last_name": "Fred"}}
          ]
        },
        {"match": {"last_name": "Fred"}}
      ]
    }
  }
}

and changing it to:

{
  "index": "people",
  "body": {
    "query": {
      "bool": {
        "must": [
          {
            "bool": {
              "should": [
                {"match": {"first_name": "Fred"}},
                {"match": {"last_name": "Fred"}}
              ]
            }
          },
          {"match": {"last_name": "Fred"}}
        ]
      }
    }
  }
}

Personally I like the deprecated and removed 2 syntax, but there is little choice in the matter if you want to upgrade to the latest and greatest elastic, which does include many new good features.

Accessing RDS from AWS Lambda

Update: Prefer const over let where possible.

Servlerless architectures are justifiably focused on avoiding the use of server focused non-scalable resources are far as possible down the stack. In the Amazon world this typically translates to using stores like DynamoDB for data. In some instances it still does make sense to have a relational database in a serverless stack. Particularly for smaller more internal applications that won’t need to scale up to millions of concurrent users, where serverless is being used just as much to keep a consistent architecture and to keep prices down as to handle huge scaling events.

There are a number of different considerations to make with this kind of system, the details of how to do this are scattered across stack overflow and AWS documentation. In this short article we pull everything together into one place and document the considerations about how to do this in a way that is secure and performant.

While many of the principals are generalisable, there will be a specific focus on using Postgres and Node on AWS Lambda, setting up the lambda function to be callable via API gateway.

The two considerations that we will be focusing on are:

1) The general setup and configuration to allow Lambda to talk to RDS.
2) how to configure connection pooling to help make the lambda access the database performantly.

General Access

  1. ensure that the Lambda is deployed to a VPC with a security group.
  2. configure your RDS instance to allow access from the security group (specify an inbound rule in RDS to allow access from the source security group the lambda function has been deployed to.

Connection Pooling

Connection pooling is different to normal. Remember that each lambda instance is independant and we are in node js so each running lambda container will serve a single process. This means that connection pools should probably only have a single connection. The purpose of the connection pool will be to ensure that if there is a warm lambda instance it will use the connection from the pool.

To make the connection pooling work, we will need to allow the lambda to keep the connection alive when the response has been set. (see http://docs.aws.amazon.com/lambda/latest/dg/nodejs-prog-model-context.html#nodejs-prog-model-context-properties). We use the pg-pool library to do this: https://github.com/brianc/node-pg-pool

So we set up a connection pool outside of your Lambda handler function, with a max pool size of 1, a min pool size of 0, a big idleTimeoutMillis (5 minutes) and a small connectionTimeoutMillis (1 second). Then in the handler function set context.callbackWaitsForEmptyEventLoop = false and then use the connection pool as normal.

const Pool = require('pg-pool')
const pool = new Pool({
    host: 'something.REGION.rds.amazonaws.com',
    database: 'your-database',
    user: 'a_user',
    password: 'password',
    port: 5432,
    max: 1,
    min: 0,
    idleTimeoutMillis: 300000,
    connectionTimeoutMillis: 1000
});

Given that we will want to hook this into API gateway, it makes sense to make the json passed into the callback look like what the API Gateway will want for a seamless Lambda integration. The full example below shows this.

const Pool = require('pg-pool')
const pool = new Pool({
    host: 'something.REGION.rds.amazonaws.com',
    database: 'your-database',
    user: 'a_user',
    password: 'password',
    port: 5432,
    max: 1,
    min: 0,
    idleTimeoutMillis: 300000,
    connectionTimeoutMillis: 1000
});

module.exports.handler = (event, context, callback) => {
    context.callbackWaitsForEmptyEventLoop = false;

    let client;
    pool.connect().then(c => {
        client = c;
        return client.query("select 'stuff'");
    }).then(res => {
        client.release();
        const response =  {
            "isBase64Encoded": false,
            "statusCode": 200,
            "body": JSON.stringify(res.rows)
        }
        callback(null, response);
    }).catch(error => {
        console.log("ERROR", error);
        const response =  {
            "isBase64Encoded": false,
            "statusCode": 500,
            "body": JSON.stringify(error)
        }

        callback(null, response);
    });
};

The code above provides a great simple lambda function that queries a database, and provides the response in json for consumption by API Gateway. The use of a simple connection pool ensures that a warm lambda function will reuse connections, and perform well. API Gateway then provides the tools needed to enforce rate limiting and to help ensure that the underlying resources don’t get starved. With the above setup a performant API using API Gateway and Lambda that is backed by Amazon RDS can be created. Bigger picture a truly scalable architecture would involve NoSQL and dynamo without the constraint of a relational database.

JS sleep (setTimeout) using promises in one line of code.

Update: Prefer const over let

As mentioned before, I like JavaScript promises.

One nice trick with promises is creating a simple sleep function using the standard setTimeout.

Here’s the line of code:

const sleep = (time) => new Promise((resolve) => setTimeout(resolve, time));

Then use it as:

sleep(500).then(() => {
    // Do something after the sleep!
})

A nice simple sleep function in JavaScript.

Magic sprinkles for Capybara and PDF

One of the frequent things I end up doing these days is generating reports. When you create a start up called Report Hero, it comes with the territory. The two most common outputs that I’m generating are PDF, and word (yeah word, I’m sorry).

I’ve recently being going through some pain getting testing working with PDFs and capybara. It took me way to long to get to the answers, so I’m going to document what I ended up with.

First, it’s worth noting that at the time of writing, poltergeist doesn’t support downloading files:
https://github.com/teampoltergeist/poltergeist/issues/485
http://stackoverflow.com/questions/35585994/downloading-a-csv-with-capybara-poltergeist-phantomjs

This means that you’ll need to use a different driver, WebKit works well for me (https://github.com/thoughtbot/capybara-webkit). As mentioned on that github project page, by adding webkit you’ll probably need to make sure that you run your tests on CI with an xvfb server.

If you are on rails 3 (did I mention I was doing this work on a legacy project), the next thing to consider is that depending on your setup you’ll need to make it possible to have two threads running at the same time to help enable the wkhtmlwrapper gem (wicked pdf or pdfkit) to work. So in you’re test.rb environment you’ll need to set config.threadsafe!

Once you’ve got PDFs being downloaded, there a number of options for performing the tests around them. They essentially boil down to reading the content using something like PDF Reader, and then performing assertions on it.

Pivotal have a decent blog post talking about this: https://content.pivotal.io/blog/how-to-test-pdfs-with-capybara , and prawn have extracted a gem to help: https://github.com/prawnpdf/pdf-inspector . For what it’s worth in this project, I went with the prawn pdf-inspecter gem.

So, the final steps I ended up with are:

  • use the webkit driver so you can download the pdf (with appropriate CI settings)
  • (in rails 3) set config.threadsafe! so the pdf generation can happen
  • use pdf-inspector to extract content from the PDF and perform some assertions.

With this setup you can

Creating a Deliverable HTML Email on AWS Lambda with SES

Update: Prefer const over let

Creating deliverable rich html emails is a great goal for many web applications, communicating with your customers, and helping to send the messages that are beautiful and make your marketing/designers happy.

As discussed in this send grid article, there are quite a number of approaches for doing this, but one of the best options for ensuring that images in emails with images. The leaders today in 2017 are using cid and referencing attachments, and using data urls.

According to the comments in this Campaign Monitor blog post, the cid method is supported by all the major clients today. (ironically the post is talking about using the cooler approach of data urls for images).

With this background, the question is how to do this with AWS Lambda and SES.

Thankfully it’s really straight forward.

The simple steps are:

  • create a simple html email that references images using cid: as the protocol.
  • create a raw rfc822 email string that can be sent with the SES api.
  • use the ses.sendRawEmail method to send the email.

1 Create a simple html email

For example:

<html><body><p>Hello world</p><img src=”cid:world”></body></html>

Note that the source of the image is of the format cid:world, this cid will be what you specify when attaching the image to the blog post

2 Create a raw rfc822 email string

The mailcomposer  package a part of Nodemail(https://nodemailer.com/extras/mailcomposer/) provides a great simple easy to use api for creating rfc822 emails with attachments. When creating attachments you can specify cids to refer to them by, and you can specify the contents of the attachment with a local filename, a buffer, or even a http resource. It’s a great api.  Take a look at the npm page to see more. One example of using this package is:

const from = 'from@example.com';
const to = 'to@example.com';
const subject = 'Subject';
const htmlMessage = '<html><body><p>Hello world</p><img src="cid:world"></body></html>';
const mail = new MailComposer({
  from: from, to: to, subject: subject, html: htmlMessage,
  attachments: [{
    filename: 'hello-world.jpg',
    path: 'https://cdn.pixabay.com/photo/2015/10/23/10/55/business-man-1002781_960_720.jpg',
    cid: 'world'
  }]
});
mail.build(function(err, res) {console.log(res.toString())});

3 Send the email with SES

Take the buffer that you create and send it with SES.

const sesParams = {
  RawMessage: {
    Data: message
  },
};
ses.sendRawEmail(sesParams, function(err, res){console.log(err, res)});

Full example using promises

Let’s put it all together, and pull in some of the promise code that I talked about in an earlier blog post(http://www.rojotek.com/blog/2017/04/11/create-a-promise-wrapper-for-a-standand-node-callback-method/)

function createEmail(){
  const from = 'from@example.com';
  const to = 'to@example.com';
  const subject = 'Subject';
  const htmlMessage = '<html><body><p>Hello world</p><img src="cid:world"></body></html>';
  const mail = new MailComposer({
    from: from, to: to, subject: subject, html: htmlMessage,
    attachments: [{
      filename: 'hello-world.jpg',
      path: 'https://cdn.pixabay.com/photo/2015/10/23/10/55/business-man-1002781_960_720.jpg',
      cid: 'world'
    }]
  });

  return new Promise((resolve, reject) => {
    mail.build(function(err, res) {
      err ? reject(err) : resolve(res);
    });
  });
}
createEmail().then(message =>{
  const sesParams = {
    RawMessage: {
      Data: message
    },
  };
  return ses.sendRawEmail(sesParams).promise();
});

Creating emails that include attachments is really quite easy with node, lambda and ses. Doing this is a great step to delivering rich emails that look like what your designers want.

 

Create a Promise Wrapper For a Standand Node Callback Method

Update: Prefer const over let

JavaScript Promises are the future, and a great pattern for doing asynchronous javascript code (allegedly async await is an awesome way to do async javascript as well, but I’m not there yet). There are great APIs for working with promises, and many standard libraries for working with Promises.

Unfortunately not all libraries support promises. Fortunately it isn’t hard to wrap a standard javascript callback pattern api in a promisified version.

The Node.js way is to have callbacks with an error first callback. These are apis which are passed a callback function with the signature function(error, success);. For a good description see this decent blog post The Node.js Way – Understanding Error-First Callbacks.

The classic example they provide is read file:

fs.readFile('/foo.txt', function(error, data) {
  // TODO: Error Handling Still Needed!
  console.log(data);
});

To convert this to a promise, create a new promise object, which calls reject with the error, and resolve with the data. If this is wrapped in a function, you’ll end up with a nice promisified readFile as per the following:

const fs=require('fs');

function readFilePromise(fileName) {
  return new Promise(function(resolve, reject){
    fs.readFile(fileName, function(err, data){
      if (err) {
        reject(err);
      } else {
        resolve(data);
      }
    });
  });
}

For extra cool kid points, use arrow functions:

const fs=require('fs');

const readFilePromise = fileName => {
  new Promise((resolve, reject) => {
    fs.readFile(fileName, (err, data)=> {
      if (err) {
        reject(err);
      } else {
        resolve(data);
      }
    })
  })
}

or to shrink it a little bit more:

const fs=require('fs');

const readFilePromise = fileName => {
  return new Promise((resolve, reject) => {fs.readFile(fileName, (err, data)=> {err ? reject(err) : resolve(data)})})
}

or to go a bit crazy with the inlining, and make your javascript look almost like haskell 🙂

const fs=require('fs');

const readFilePromise = fileName => new Promise((res, rej) => fs.readFile(fileName, (e, d) => e ? rej(e) : res(d)));

So it’s easy to see that any asynchronous node callback style api can be wrapped in a promise api with 10 lines of readable code, or 1 line of terse javascript.

Adding rubocop to a legacy project

To add rubocop to a legacy project, first grab a .rubocop.yml that specifies your projects code style and then do:

rubocop -c .rubocop.yml --auto-gen-config --exclude-limit 500

Then you’ll want to include the automatically generated todo file into your .rubocop.yml.

inherit_from: .rubocop_todo.yml

Run rubocop. Any violations that you now see will be caused by your config overriding the todo exclusions.  Find the cops causing problems using rubocop -D

Then fix them by doing things like:

increasing your Metrics/LineLength in  the .rubocop.yml

or perhaps setting the Enabled to false.

By doing this you can relatively quickly add rubocop to a legacy project with settings matching an organisations coding style, ready for you to really start making a codebase better.

Report Writing for Occupational Therapists

Over the past 18 months I’ve been working on Report Hero, report writing software to help Paediatric Occupational Therapists write reports.  I’ve built it in partnership with one of the best Paediatric OT’s I know, and I’m happy to see it being used by a number of O.T’s.  If that’s your thing, take a look at the Report Hero website, and sign-up for a trial.

9 Things I learnt while moving data from RedShift into AWS Elastic Search with AWS Lambda

The amazon infrastructure is amazing and allows for interesting and cool scaling without the use of servers. It’s exciting to see what can be done. The trick with much of this is that many of the elements are asynchronous and so it can be easy to flood services, particularly when pulling data out of your RedShift data warehouse and putting it into Elastic Search. I’ve learnt a bunch of things while doing this, the salient points are below.

  1. Don’t gzip the data unloaded.
  2. Use the bulk load on elastic
  3. Use a large number of records in the bulk load (>5000) – fewer large bulk loads are better than more smaller ones. When working with AWS elastic search there is a risk of hitting the limits of the bulk queue size.
  4. Process a single file in the lambda and then recursively call the lambda function with an event
  5. Before recursing wait for a couple of seconds –> setTimeout.
  6. When waiting make sure that you aren’t idle for 30 seconds because your lambda will stop.
  7. Don’t use s3 object creation to trigger your lambda — you’ll end up with multiple lambda functions being called at the same time.
  8. Don’t bother trying to put kinesis in the middle – unloading your data into kinesis is almost certain to hit load limits in kinesis.
  9. Monitor your elastic search bulk queue size with something like this:
    curl https://%ES-SERVER:PORT%/_nodes/stats/thread_pool |jq ‘.nodes |to_entries[].value.thread_pool.bulk’

1 Unloading from RedShift

The process of doing the gunzip in the lambda takes time + resources in the lambda function. Avoid this by just storing the CSV in s3 and then streaming it out with S3.getObject(params).createReadStream().

Here is an unload function that works well for me.

UNLOAD ('%SOME_AMAZING_QUERY_FROM_A_BIG_TABLE%')
TO 's3://%BUCKET%/%FOLDER%'
credentials 'aws_access_key_id=%AWS_KEY_ID%;aws_secret_access_key=%AWS_ACCESS_KEY%'
DELIMITER AS ',' NULL AS '' ESCAPE ADDQUOTES;

2 Use the bulk load in elastic

The elastic bulk load operation is your friend. Don’t index each record one at a time and consume lots of resources, instead send up batches at the same time using the bulk operation.

3 Use a large number of records in the bulk load

More than 5000 records at a time in the bulk load is important to do. Fewer big loads is better than more small ones. See https://www.elastic.co/guide/en/elasticsearch/guide/current/bulk.html#_how_big_is_too_big for setting the number and size of this.

4 Process a single file in each lambda function

To ensure that you don’t consume too many resources, process a single file in each lambda function then recurse using

Lambda.invoke({
    FunctionName: context.invokedFunctionArn,
    InvocationType: 'Event',
    Payload: JSON.stringify(payload)
});

Use either the promise version or callback version as preferred. Keep track of where you are in the payload.

5 Wait before recursing

Before doing the callback above wait a couple of seconds to give elastic a chance to catch up.
setTimeout(function(){recurseFunction(event, context, callback)}, 2000);

6 Keep the wait short

If you don’t do anything for 30 seconds Lambda will timeout. Keep the wait short. 2 seconds (as chosen above) wasn’t completely arbitrary.

7 Don’t use s3 object creation to trigger your lambda

One of the things we are seeing consistently is trying to control the rate of data flowing into elastic. Using the s3 object creation triggers for the lambda will result in multiple concurrent calls to your lambda function. This will result in too much at the same time. Trigger the lambda some other way.

8 Kinesis isn’t the answer to this problem

Putting the records to index into kinesis will not act as a good way to control the massive flow of data from redshift to elastic. While kinesis is great for controlling streams of data over time, it’s not really the right component for this scenario of loading lots of records at once. The approach outlined throughout this document is suitable.

9 Monitor your elastic resources with curl and jq

Unix commandline tools rock.

curl and jq are great tools for working with http data. curl for getting data, jq for processing json data.(https://stedolan.github.io/jq/)

elastic provides json apis for seeing the data. The below command is how to look up the information on the bulk queue size.

curl https://%ES-SERVER:PORT%/_nodes/stats/thread_pool |jq '.nodes |to_entries[].value.thread_pool.bulk'

Conclusion

Serverless + the AWS stack is nice — you need to think about how to use it and knowing the tools + capabilities of the platform is important — with care you can do amazing things. Go build some great stuff.

Open Badges

I’ve recently been looking into the Open Badges Framework, with a goal of being able to understand what it is from a high-level technical standpoint.

The Open Badge Specification provides a standard for issuing badges

The key participants in the open badge system are:

  • the badge issuer
  • badge displayers
  • badge earners
  • a badge backpack

A badge issuer will create a badge and issue it to a badge earner. The badge will consist of a number of cryptographically verifiable assertions about the badge. With the earners consent an issuer may publish the badge to a badge backpack.

There is a reference implementation of a badge backpack implemented by mozilla. This reference impementation is hosted out of the united states, and is probably the default way to publish badges. The source code for the reference implementation has also been made available for download and deployment (https://github.com/mozilla/openbadges-backpack).

In a healthy open badge ecosystem, there would be a small number of badge backpacks, a larger number of issuers, and an even larger number of earners.

Every organisation that wants to issue badges would need to be an issuer, but most organisations would (and should) be able to use a standard backpack. That said, when dealing with children, legal rules may lead to the creation of regional badge backpacks.