I would like to run a Splunk query over a long period of time (e.g., months or years), but I am searching enough data that I am only able to search over hours or days of data.
However, for the question I want to answer in Splunk, I would be satisfied with a uniform or statistically unbiased sample of data. In other words, I would prefer the query return N events spread out over the past month, than any N consecutive events.
One way I considered was to only search events with date_minute=0
so as to quickly filter 1/60th of the events, which helps but is not very flexible.
Is there a better way to sample events efficiently in Splunk?
If you are trying to run a search and you are not satisfied with the performance of Splunk, then I would suggest you either report accelerate it or data model accelerate it. Or you can create your own tsidx files (created automatically by report and data model acceleration) with tscollect
, then run tstats
over it.
I found a related discussion on sampling on the Splunk Answers page below.
http://answers.splunk.com/answers/3743/is-it-possible-to-get-a-sample-set-of-search-results-rather-than-the-full-search-results
An alternative to filtering by date_minute
or date_second
, is to filter events in a where
clause using the _serial
property or the random()
function. For example,
* | where (_serial % 60) = 0 | ...
or
* | where (random() % 60) = 0 | ...
However, in both cases the search appears to do a full scan of the data. This may still be desirable if you need the flexibility and the result is feeding into a more expensive query. Otherwise, using the date_second
approach is significantly faster because events are apparently indexed by that field. For example, the two queries above ran in 3m 20s
on a subset of data, where the query below ran in 11s
on the same data.
* date_second=0 | ...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With