Managing Instance-backed AWS Images

Recently, to support a client I needed to roll a custom Amazon Machine Image (AMI) development environment with a specific configuration of RStudio Server and assorted packages. While I needed to update this environment with the latest security fixes and package changes, with a relatively short total project time, it wasn’t worth spinning up additional infrastructure and investing time to create a Jenkins-driven automatic build pipeline. In this post, I’ll go over some of the alternative design choices I made in building and using this environment. Even though this project was exclusively a data science and R development engagement, having the ability to solve infrastructure related issues in the cloud kept me productive and cost-efficient.

Whenever possible, I use instance storage for my EC2 instances. This ephemeral storage is physically local to the host machine and is on SSD disks, giving performance that is hard to match on EBS without going for EBS-optimized instances, provisioned IOPS, and other (often expensive) tricks. It’s also completely free while with EBS you’re going to be paying some sort of charge (even if small). The downside is that instance storage truly is ephemeral. Kill that instance and the whatever was on the storage is gone. Stopping an instance backed with instance storage is also impossible (though reboots are fine). I was working with some fairly hefty instance sizes and wanted to minimize my infrastructure costs, leveraging spot instances to get the best possible rates for my compute time. To ensure that I wouldn’t loose data while working, I set up an AWS EFS mount for any persistent data. Should I loose an instance (which happened a few times during the total length of the project) I could always spin up a new machine, mount my home drives, and continue with minimal impact.

I was heavily using R Notebooks in RStudio to do interactive exploration and development on my data sets. This feature wasn’t yet GA and required that be running a preview version of RStudio Server. The preview release was fast moving in the final run up to the 1.0 RStudio release, with new updates coming out on a daily basis. Many of these changes didn’t affect my workflow, but there were occasional enhancements or bug fixes that I wanted to use. To capture these changes, I made a simple Packer workflow. This manually triggered process would grab a current Ubuntu image, apply the most current security fixes, then trigger a Chef run to install and configure all the R packages, RStudio Server, and development tools I needed. Finally, as spot instances don’t inherit the tags of the parent spot request (one of a fairly long list of personal gripes with AWS tagging), I deployed a tiny on-boot script to inspect the spot-request for tags and apply them to the launched instance, ensuring my project and cost accounting associated these correctly.

The AWS tool chain for creating instance-backed instances is showing a bit of neglect these days. While the AWS CLI has built in support for making EBS-backed instances, instance-backed AMIs requires a different toolchain. Getting this toolchain to work with Packer was also a little harder than I first expected, with custom bundle_vol_command and bundle_upload_command configurations required to successfully build the image. Not overly difficult and it all worked smoothly once I figured out the syntax, but it was a challenge I didn’t expect at the onset.

With the rapid changes in RStudio and some of the packages I was using, I was making new images on roughly a weekly scheduled. Another complication of instance-backed AMIs is that they are stored in user-managed S3 buckets, rather than as EBS snapshots and de-registering the image does not clean up the backing image files. Left untended, this can result in gigabytes of unused data files hanging out in S3. To manage this, I made a scheduled Lambda function(code provided) that would go out on a weekly basis, loop through all the AMIs I owned, and both de-registered and delete the backing S3 files for all but the most current for each given image type (identified via a tagging system).

Actual deployment of the environment was handled via a Terraform configuration file which looks for the most recent image matching a defined set of tags and launches the image, setting up all the VPCs, security groups, EFS endpoints, and other infrastructure. I use independent VPCs for every project I perform (and will be looking at independent accounts now that AWS is rolling out the long-hoped for organizational management APIs). My Terraform processes leverage a common shared infrastructure remote state for things such as Route53 zones, EFS file systems, monitoring & logging, etc., but all the details on Lambdas, EC2, VPCs, etc. are completely dedicated to the project. This way I maintain positive control over the infrastructure and costs associated with each project.

All told, getting the build process up and running was perhaps a day and a half of total work. Now that I have some established patterns for how to do builds and management, other environments can be set up in a fraction of that time.