There’s a rite of passage for every #rstats-using data scientist with Type B tendencies. In this story, our hero(-ine) gets R up and running on a cloud instance and posts a blog entry on their experiences. Far be it for me to eschew the traditions of my predecessors. As part of an internal mentoring I’m doing for some coworkers that are going through the Coursera Data Science program, I recently built a pipeline for creating custom images running on spot instances, using TLS-terminating load balancers, dynamically updating DNS and the network security groups, all with a fully configured RStudio Server. So sit right back and I’ll tell the tale of how the R was won, along with sharing the code for replicating in your environment should you be so inspired.
Installing RStudio Server
Installing RStudio Server itself on a Linux box is as easy as downloadinga package from the nifty folks at RStudio for the Linux distribution of your choice. Never one to take the easy path, I went a bit further and leveraged Chef to have an idempotent and repeatable set of code which I could deploy on a variety of different Linux flavors, customized to my needs, and have configured with exactly the libraries, user, and settings that I want.
As a first step, I fired up an environment with one of my favorite tools, Hashicorp’s Vagrant and started to built a set of Chef recipes that would take me from a bare install up to a full configured RStudio Server environment. I leveraged two community cookbooks for this task, the R cookbook for installing the core R environment itself and the RStudio-Server cookbook for the the tasty IDE bits. Since I knew this was going to be used in a time-bound and network limited manner, I also created a custom cookbook (sch-trainingusers) to auto provision a suite of generic training users so that each learner in my sessions could have their own login.
Along the way I needed to file a couple of pull requests for both the R and RStudio cookbooks as they were mostly set for using Ubuntu images. While I run a lot of Ubuntu myself, I primarily run AWS Linux for workloads that are semi-production so I forked both cookbooks and incorporated support for AWS Linux, Redhat, and Centos. There’s still a few merges to take place and PRs to file, so I’m still referring to my personal (though public) repositories, but I expect to go back to using the official branches in the very near future.
One of the strengths of the #rstats community is the wonderful ecosystem of libraries. To ensure that common tools are available, I used the R cookbook’s functionality to pre-load a variety of packages on my server. While everyone has their own preferred lists of packages, for this project I’m deploying (in no particular order): dplyr, data.table, ggplot2, scales, tidyr, caret, knitr, binom, rmarkdown, httr, harvest, readr, devtools, and readxl. Many of these require build tools on the local install, which the dependency on the build-essential cookbook takes care of for me.
Outcome: One command (“vagrant up”) VMWare/AWS one-off instance of RStudio Server, fronted by nginx.
Building an Image
At this point I could easily stand up my RStudio Server with a single vagrant up command. But as that starts from a clean install, all the package downloading and compiling could take 10-15 minutes depending on the size of the server I was running, how fast the internet tubes were behaving at that point, etc. Given that the server binary and package versions aren’t changing very often, this presented a great opportunity to bake that image, creating my own custom AMI with everything already installed and configured just the way I want it.
One mechanism to do this would be simply to go into the AWS web console and create an AMI from an instance already set up and running. That’s distressingly GUI driven and not something I could put into a full continuous delivery pipeline. Instead, I turned to another Hashicorp tool, Packer, to take my vagrant process and, instead of creating a running machine, output a custom AMI. You can access the template I created on GitHub.
Outcome: Preinstalled image with full RStudio Server configuration and supporting packages, launchable via standard Amazon APIs at a fraction of the time of building from scratch each time.
One Click Deploy on AWS
Now I had the ability to spin up an EC2 instance with my RStudio Server instance pre-loaded and configured just right. Done, right? Well, almost. In addition to the basic service, there is more infrastructure required to get things working to the end users. DNS resolution has to be up and running, TLS offloading needs to be functional, security groups defined, and because my dollars are precious (hello, home ownership!) I also wanted this to leverage AWS spot instances to pay the absolute minimum needed to get things running for my users. How to accomplish this setup on a regular basis and then tear everything down when I’m done? Enter CloudFormation!
I’ll confess to a love-hate relationship with CloudFormation. As a means of automatically configuring chunks of Amazon infrastructure, it’s incredibly powerful. At the same time, the language is big blocks of JSON, which is a cursedly picky syntax to write in (darn you, missing commas!). I desperately want to dive into Terraform or even SparkFormation to avoid working in it in the future, but for now, this works fine. While the production version of this template contains some non-public information (domain names, account numbers, etc.), a lightly scrubbed version is available as a public Gist.
Outcome: Able to one-click deploy a stack based on an S3-stored CloudFormation template that deploys a RStudio Server instance at a user-defined spot price, along with TLS offloading, DNS entries pointing to the load balancer, and security groups limiting access to my users.
What a Long Strange Trip It’s Been
I’m very happy with the solution and have stood up instances based on this template several times. I’ve also used the packer template to update my server image a couple of times as I refined my list of packages I wanted available by default. There are a few things I could do better – I’m not satisfied with the way training users are set up and there really should be a transparent redirect from HTTP to the HTTPS port to give a smoother experience to users. Those are very minor warts though and for something that only took a few days of noodling after work hours, it’s a very acceptable outcome.
I hope this was useful. If you’d like further details on any of the steps I’ve done or the design decisions made, please comment or drop me a line!