The concept of performing full stack data analysis came up in a recent conversation. While I’ve seen the term mentioned a half dozen or so times in recent weeks, I haven’t found a satisfactory definition. Formal definitions are tricky things and are arguably a product of more mature disciplines than the gangly adolescence of information security risk management can easily produce. Instead, I’ll briefly enumerate some of layers that make up the analysis stack at my workplace.
But first, why bother with this fairly lengthy list of tools, infrastructure, skill sets and other odds and ends? Someone eager to put data to work with data in inform their risk management programs is likely to find themselves at least a little (if not a lot) ahead of their organization’s IT departments. This isn’t a knock against corporate IT. They’re understandably occupied with supporting business functions with the typical big corporate applications, web sites, and other tooling that those functions require. Start talking about infrastructure as code, NoSQL databases, ephemeral cloud infrastructure, the pros and cons of YAML/JSON/XML/LMNOP, etc. and your typical industry veteran is going to give you some awfully blank looks. While I hope that DevOps and other IT process improvement movements will, given time, permeate into all sorts of organizations, for now the aspiring evidence-based risk manager is probably far more likely to be working with cobbled-together castoff hardware or using cloud technologies on a limited scale to acquire the tools to work with their data.
This means that you get to be your own architect, network engineer, programmer, statistician, data analyst, visualization designer, marketer…well….you get the idea. To call us Jacks of All Trades doesn’t cover the half of it! Without further ado, here are the major layers in which my team has found needs and the tools we’ve applied to each domain.
You’re not going to get very far without some CPU cycles, memory to hold your working data set, and storage to persist your data. In our case, we’re using laptops with VMWare as a primary development platform, with AWS letting us scale out to larger hardware and to run persistent processes.
That bare metal has to run some sort of operating system, even if a very minimal variety. We’ve been heavily relying on Ubuntu 12 & 14, which seems to have the best degree of support across the collection of open source tools we need. We have some workloads on both AWS Linux (a Red Hat variant) and CentOS. We also have some Windows platforms that we use for data acquisition within our corporate environment.
While tools such as Alteryx and KNIME are very interesting graphical environments, there’s just no avoiding having to write code. At the very least, writing code allows you to take your iteratively designed work products and automate them, allowing you to move on and go on to the next interesting problem! This is an area I struggle with on my team, as the explosion of languages here is frankly more than we can comfortably staff for. As of today, we’re heavily developing with R, Python, PowerShell, and a good chunk of Ruby.
Source Code Management
You’ve just developed a script that solves a problem – congratulations! Now you’ve just created a new problem – how are you going to manage that code, including bug fixes, feature changes, sharing it with other individuals in your team/organization/peer group, etc.? Enter source version control tools. When it comes to source control, the de facto solution today is Git. We use Git and GitHub for all of our source code. I’ve heard good things about GitLab for local hosting as well, but have not yet had a need to take on that level of complexity.
CSVs and other flat files are not to be knocked, but there’s just no substitute for a database to store data and be able to pull it back out in a variety fo forms for consumption. We use a number of platforms, ranging from desktop Microsoft products such as Access, to RDBMS solutions liek SQL Server, open source products including MySQL and Postgresql, to NoSQL technologies such as MongoDB. We’ve talked about graph technologies, but haven’t yet gotten to a point where we’ve needed them…yet.
Specific Applications Knowledge
Beyond the databases are a number of domain specific data stores and processing engines. In this category are Elasticsearch (and Logstash and Kibana) and Hadoop (specifically Pig, with some work forthcoming soon).
Data Visualization and Report Creation
All that analysis does no good if you can’t communicate your message. Enter the wonderful world of data visualization. Here my team has been able to take advantage of our organization’s large scale commitment to Tableau for rapid data exploration and dashboard creation. Dirty secret here though, I’ve been doing a lot more with Rstats graphics than Tableau recently. I’m eager to try out some of the recent Shiny apps features in RStudio as possible replacements for some use cases currently handled by Tableau.
General Technical Knowledge
Beyond all the tooling is a general layer of domain specific knowledge that I would be remiss to leave out. Technical security knowledge in areas such as networking, firewalls, encryption, and other aspects as you may find on your various certification exams are still very important to us. Without domain knowledge, we’d be incapable of framing our questions, knowing what data to gather, or have a frame of reference to guide our always overtaxed schedules and research schedules.