Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regexp search through a very large file

Tags:

regex

ruby

I need to scan a 300MB text file with a regex.

  • Reading the whole file and taking it into a variable eats over 700MB of RAM and then fails with "cannot allocate memory" error.
  • The match can be in two or three lines, so I cannot use line-to-line stepping in loop.

Is there any lazy method to do a full file scan with a regex without reading it into a separate variable?

UPD

Done. Now you can use this function to read by chunks. Modify it for your goals.

def prepare_session_hash(fname, regex_string, start=0)
  @session_login_hash = {}
  File.open(fname, 'rb') { |f|
    fsize = f.size
    bsize = fsize / 8
    if start > 0
      f.seek(start)
    end

    overlap = 200

    while true
      if (f.tell() >= overlap) and (f.tell() < fsize)
        f.seek(f.tell() - overlap)
      end
      buffer = f.read(bsize)
      if buffer
        buffer.scan(s) { |match|
          @session_login_hash[match[0]] = match[1]
        }
      else
        return @session_login_hash
      end
    end
  }
end
like image 655
Alexander.Iljushkin Avatar asked Dec 17 '12 07:12

Alexander.Iljushkin


1 Answers

  1. Traverse the file in chunks, instead of line by line, where chunks are created by occurences of a frequently occuring character or pattern , say "X".
  2. "X" is such that it never exists in your regex, i.e. "X" is where your regex will never match the string.
  3. Match your regex in the current chunk,extract matches and proceed to next chunk.

Example:

This is string with multline numbers -2000
2223434
34356666
444564646
. These numbers can occur at 34345
567567 places, and on 67
87878 pages . The problem is to find a good
way to extract these more than 100
0 regexes without memory hogging.

In this text, assume the desired pattern is numeric strings e.g /d+/s match digits multiline, Then instead of processing and loading whole file, you can chose a chunk creating pattern, say FULL STOP in this case . and only read and process till this pattern, then move to next chunk.

CHUNK#1:

This is string with multline numbers -2000
2223434
34356666
444564646
.

CHUNK#2:

These numbers can occur at 34345
567567 places, and on 67
87878 pages

and so on.

EDIT: Adding @Ranty's suggestion from the comments as well:

Or simply read by some amount of lines, say 20. When you find the match within, clear up to the match end and append another 20 lines. No need for figuring frequently occurring 'X'.

like image 92
DhruvPathak Avatar answered Nov 15 '22 22:11

DhruvPathak