Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I efficiently sample a long time span in Splunk?

I would like to run a Splunk query over a long period of time (e.g., months or years), but I am searching enough data that I am only able to search over hours or days of data.

However, for the question I want to answer in Splunk, I would be satisfied with a uniform or statistically unbiased sample of data. In other words, I would prefer the query return N events spread out over the past month, than any N consecutive events.

One way I considered was to only search events with date_minute=0 so as to quickly filter 1/60th of the events, which helps but is not very flexible.

Is there a better way to sample events efficiently in Splunk?

like image 790
Arel Avatar asked Oct 02 '22 03:10

Arel


2 Answers

If you are trying to run a search and you are not satisfied with the performance of Splunk, then I would suggest you either report accelerate it or data model accelerate it. Or you can create your own tsidx files (created automatically by report and data model acceleration) with tscollect, then run tstats over it.

like image 70
hobbes3 Avatar answered Oct 13 '22 12:10

hobbes3


I found a related discussion on sampling on the Splunk Answers page below.

http://answers.splunk.com/answers/3743/is-it-possible-to-get-a-sample-set-of-search-results-rather-than-the-full-search-results

An alternative to filtering by date_minute or date_second, is to filter events in a where clause using the _serial property or the random() function. For example,

* | where (_serial % 60) = 0 | ...

or

* | where (random() % 60) = 0 | ...

However, in both cases the search appears to do a full scan of the data. This may still be desirable if you need the flexibility and the result is feeding into a more expensive query. Otherwise, using the date_second approach is significantly faster because events are apparently indexed by that field. For example, the two queries above ran in 3m 20s on a subset of data, where the query below ran in 11s on the same data.

* date_second=0 | ...
like image 38
Arel Avatar answered Oct 13 '22 10:10

Arel