Scroll Queries in Elasticsearch

While Kibana offers a nice interface to quickly zoom in on activity expressed in logs processed in Logstash and stored within Elasticsearch, there are some situations where having access to the full parsed event stream is a must. For pulling out bulk data, I’ve recently become a fan of Elasticsearch scroll queries. This post will quickly cover the process and sample code to extract events of interest for analysis.

Scroll queries are an API in Elasticsearch for doing things that are traditionally not very Elasticsearch-y in nature. While most search queries are focused on getting the most relevant results out of a data store, they aren’t well suited for pulling back every possible result. An example of this paradigm is the Google search page. You can put in your search for, say, fuzzy poodles (hey, I don’t judge) and quickly find the link for Little Fuzzy Tea Poodles.com, images of poodles (in various states of fuzziness), and so forth, but you don’t get back all 596,000 hits that Google could return to you. This is a perfectly valid assumption when most users don’t go beyond the first page in their search results, but if you’re trying to create a longitudinal study of access patterns in your log entries, you usually need the entire event stream matching your criteria. Scroll queries allow you to ask Elasticsearch for every last entry matching a query and then to get the results back in chunks which sequentially represent the entire set of matching records. This may take some time to execute and retrieve, but hey…if you wanted real-time search you would have made a regular search query now, wouldn’t ya?

While most of my work tends to be in PowerShell or R, when it comes to generic API interfacing and data munging my tool of choice is definitely Python. Elasticsearch provides the Elasticsearch-py module, available via pip, as the officially supported Python client. The elasticsearch-py module is relatively easy to use for a python newbie. You can follow along with this gistand see the following steps used to extract data from a sample Elasticsearch cluster.

The steps are:

Establish a connection to the Elasticsearch cluster,
Construct a query with the same syntax as a regular search (in this case, querying for all traffic coming into our environment from two Google DNS source IPs),
Send the query using the scroll helper function, which takes care of making the calls for each chunk of the data and making it available to python as a single unified stream.
Unwrap the results and save to a comma-separated file (actually a tab-separated file in this case, but close enough).
The output from this can then be fed into further analysis tools, such as the Marx video creator script from +Jay Jacobs.

This has been a generally successful pattern, though I have had some weird issues with queries timing out or not returning all of the expected results. I suspect this is because I am pulling out some really large datasets (often a gig or more of traffic) and I’m often working from a remote location relative to my cluster. I need to get a tool server set up to retry this queries on a box a bit more stable and logically closer to my cluster to do performance testing. Glitches notwithstanding, working with scroll queries has been a good experience as this is a common programing model. I’ve had to work through similar scrolls with MongoDB and most recently with an intelligence feed vendor where being able to do iterative queries to return partial results has been invaluable. But that’s a post for a different day. Until then, happy scrolling!