Without regular maintenance, software stacks, like people, tend to spread out and get lumpy as they age. When running processes on AWS, staying still can be especially painful as the Amazon platform features evolve rapidly. While this rarely breaks stable features, I recommend keeping an eye on the direction the services you use are evolving in and that you avoid getting too comfortable with one particular way of doing things as the One True Way. Otherwise, by the time you try to take advantage of the latest super-scalei-fragilistic-excloudi-ociousness, you may find your nice pipelines have grown inter-dependencies like kudzu vines. It was this sort of scenario that I was recently fighting with two of my ELK and EMR pipelines.
One of our first workloads on AWS was a simple EMR job for some log parsing. As our needs grew, we stood up an ELK cluster for processing much of the same data. To keep the number of copies of data files to a minimum, we put some hooks in to the EMR process to copy files over to our ELK buckets. This worked great until I needed to change the pipelines, when timing issues between the two workflows really became an issue. Further complicating the issue, we wanted to copy some of these files over to other locations for even more pipelines.
Our solution was to rethink the file copy process. We now place all of these on premises files into a central S3 bucket for archival and storage, then leverage S3 events to notify a custom Lambda function. Now when objects are dropped into this S3 bucket, a message is sent to SNS which then notifies a unique Lambda function for each consuming pipeline. The Lambda function has a local configuration that knows what pipeline it’s servicing, copying the files its consumer needs to the appropriate location as well as uncompressing tarballs into their constituent parts.
The code of this function, as well as more comprehensive directions on use and deploying, are now available on GitHub.