These last few days of 2020 (cue the chorus: “Get thee Behind Me!”) are a company-wide holiday over at the day job. While I have been trying to break some destructive workaholic patterns, I’ve allowed myself to fiddle with some cloud infrastructure efforts as a sort of methadone for what the normal work week entails. Most recently, that’s involved standing up a proof of concept API for a new data product we’ll be releasing later in the new year.
While the details of this new data product are exciting, it’s not quite ready for public unveiling and this is a personal blog in any case. But you should stay tuned to cyentia.com for more. The initial release of the product involves a mess of interesting architecture in its own right, culminating in a static CSV file that is generated on a periodic basis. That meets our initial needs, but one feature cut from the initial design specs was having a API to allow retrieving of specific segments of the data. I wanted to have a go at mocking up a quick API with a minimum amount of new code, infrastructure, or really any ongoing work at all. I was happy to realize that this was very doable on AWS and able to get that minimum viable, proof-of-concept, JSON API up and going with only a few bumps along the way.
My solution currently has the following components:
- AWS API Gateway (v1 - REST)
- DynamoDB table
- S3 Notifications
- Python Lambda function
- Terraform IaC
Populating the database
My first step was to take the static S3 file and get it into a database of some form. The data set isn’t particularly large nor very wide. Setting up a relational database would have been overkill. Deploying a small DynamoDB table was trivial to setup and, given our super light usage, has a good chance of staying within AWS’s free tier, being a no cost solution. DynamoDB is a service I’ve only lightly used (mostly for Terraform locks) and it was nice to have a use case for this powerful and lightweight key-value store.
The key challenge in getting this into Dynamo was a data structure problem. In the native S3 format, we have a composite key structure. Think of rows of data with a department and employee name being the key for a bunch of supplemental facts. Department and employee name are guaranteed to be unique, while subscribers are expected to only be looking up departments. In R this is a simple nested data structure. Transforming this into JSON for Dynamo isn’t very straightforward as the DynamoAPI has its own JSON schema where data types have to be added to each of the fields. So a department with nested employees (and supplemental information), with each field being a nested map of data type and the value. Yuck!
The Python boto3 client has some nifty magic built in that will auto-detect
the data types of passed
dict objects and do the Right Thing(tm) for the
API. This is a super handy bit of functionality that the corresponding
paws package in rstats lacks. With reluctance, I dusted off my rusty
python and managed to do some probably very non-pythonic things to created
nested dict objects from the unnested original CSV. Loading it into Dynamo was
then a trivial API call.
Exposing the database as an API
With the data now in a queryable database, my next challenge was to expose this to the world in an access controlled and performant fashion. Enter Amazon’s API Gateway service. Using the REST gateway option (under the older V1 offering), there’s the capability of wiring up API methods and resources directly to arbitrary AWS services. This takes input from the API user, optionally applies transformations, sends the transformed input to an AWS service, optionally transforms the response from the service and send it back to the user. Wonderful!
The details of this rather complex dance of transformations aren’t terribly interesting. The curious should look at this blog from AWS – it’s the template I used for my own configuration. With the basic API methods created and the REST service ready, I had access to API Gateway’s access keys and usage plans for access control. Usage plans plus keys generate static tokens for each subscriber, controlling access and setting limits on the volume of requests as subscription terms and application needs dictate. Additionally, internal keys can be created for unlimited internal use.
Keeping the data current
Now I had a working API service, but the source data changes on a continual
basis, with batched drops happening daily. I needed to get the current data
into the DynamoDB table. Here’s the work on that Python script paid off. AWS
S3 has a handy offering S3 Notifications that fires off events when various
actions occur in a S3 bucket. By configuring events to trigger
PutObject operations took place with a specific object key and a
file extension (representing a file upload of a new data file), that Python
script, packaged as a Lambda function, could retrieve the file, do the
nesting operation into the JSON our API needs, and post it all to DynamoDB.
Wiring this all up as infrastructure as code (IaC)
Initial development took place in a dedicated sandbox account (thanks to AWS Organizations, you magnificent blankity-blank). As this was using services I was less familiar with, there was a fair amount of console clicking about for the initial configuration. To make this truly production ready (even if proof of concept), I created an internal Terraform module that takes a inputs of the S3 location as well as the list of customers, then creates the API gateway (with all the transformations), creates the DynamoDB back end, sets up all the Lambda powered data updates, manages the unique IAM roles and permissions needed, and also creates both internal and subscriber (configurable, natch) API keys. With the power of Terraform, this is now version controlled, repeatable, and auditable infrastructure. Some of the various CI checks we have on our Terraform control also revealed some good guidance on specific resources I hadn’t considered, such as setting up point-in-time backups on the DynamoDB table.
Looking back to look forward
I’m quite pleased with how this all came together. That being said, it’s a rare project that I come out of without a laundry list of things that I’d like to enhance/rework/update. Some of those items from this project include:
- Tighter IAM permissions - I use dedicated service roles for each component of this solution, but the permissions could be stricter. There are several AWS managed policies employed, which I don’t trust implicitly. There’s good blast radius control on the account hosting this though, and the permissions are by no means egregious.
- Better logging - The Lambda pipes logs to CloudWatch which is great for reviews, but I don’t have a pattern for surfacing job success/failures to the team in a way that doesn’t cause alert overload.
- Expiring data entries - Some of the
departmentkeys may drop out of the data set. As the Lambda does a blind load of all keys from the current source file, without checking for missing data, it’s possible for data to be orphaned and stale data to continue to be served. Item expirations may be a solution here.
- More efficient data uploads - The Lambda does a single
PutItemcall per key. This takes several minutes to run a full load. A batched put would probably be faster.
- DynamoDB performance - I’m using a statically provisioned number of read and write capacity units. Switching to a more elastic model may be justified from both a performance and cost basis if this goes forward as something more customer-facing.