Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Easier way to search through large files in Ruby?

Tags:

ruby

I'm writing a simple log sniffer that will search logs for specific errors that are indicative of issues with the software I support. It allows the user to specify the path to the log and specify how many days back they'd like to search.

If users have log roll over turned off, the log files can sometimes get quite large. Currently I'm doing the following (though not done with it yet):

File.open(@log_file, "r") do |file_handle|
    file_handle.each do |line|
        if line.match(/\d+++-\d+-\d+/)
          etc...

The line.match obviously looks for the date format we use in the logs, and the rest of the logic will be below. However, is there a better way to search through the file without .each_line? If not, I'm totally fine with that. I just wanted to make sure I'm using the best resources available to me.

Thanks

like image 922
ckbrumb Avatar asked May 06 '13 13:05

ckbrumb


3 Answers

  • fgrep as a standalone or called from system('fgrep ...') may be faster solution
  • file.readlines might be better in speed, but it's a time-space tradeoff
  • look at this little research - last approaches seem to be rather fast.
like image 52
tkroman Avatar answered Nov 10 '22 01:11

tkroman


Here are some coding hints...

Instead of:

File.open(@log_file, "r") do |file_handle|
  file_handle.each do |line|

use:

File.foreach(@log_file) do |line|
  next unless line[/\A\d+++-\d+-\d+/]

foreach simplifies opening and looping over the file.

next unless... makes a tight loop skipping every line that does NOT start with your target string. The less you do before figuring out whether you have a good line, the faster your code will run.

Using an anchor at the start of your pattern, like \A gives the regex engine a major hint about where to look in the line, and allows it to bail out very quickly if the line doesn't match. Also, using line[/\A\d+++-\d+-\d+/] is a bit more concise.

like image 22
the Tin Man Avatar answered Nov 09 '22 23:11

the Tin Man


If your log file is sorted by date, then you can avoid having search through the entire file by doing a binary search. In this case you'd:

  1. Open the file like you are doing
  2. Use lineo= to fast forward to the middle of the file.
  3. Check if the date on the beging of the line is higher or lower than the date you are looking for.
  4. Continue splitting the file in halves until you find what you need.

I do however think your file needs to be very large for the above to make sense.

Edit

Here is some code which shows the basic idea. It find a line containing search date, not the first. This can be fixed either by more binary searches or by doing an linear search from the last midpoint, which did not contain date. There also isn't a termination condition in case the date is not in the file. These small additions, are left as an exercise to the reader :-)

require 'date'

def bin_fsearch(search_date, file)
  f = File.open file

  search = {min: 0, max: f.size}

  while true
    # go to file midpoint
    f.seek (search[:max] + search[:min]) / 2

    # read in until EOL
    f.gets

    # record the actual mid-point we are using
    pos = f.pos

    # read in next line
    line = f.gets

    # get date from line
    line_date = Date.parse(line)

    if line_date < search_date
      search[:min] = f.pos
    elsif line_date > search_date
      search[:max] = pos
    else
      f.seek pos
      return
    end
  end
end

bin_fsearch(Date.new(2013, 5, 4), '/var/log/system.log')
like image 35
jbr Avatar answered Nov 10 '22 01:11

jbr