Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multi-line regex search in whole file

I've found loads of examples on to to replace text in files using regex. However it all boils down to two versions:
1. Iterate over all lines in the file and apply regex to each single line
2. Load the whole file.

No. 2 Is not feasible using "my" files - they're about 2GiB...
As to No. 1: Currently this is my approach, however I was wondering... What if need to apply a regex spanning more than one line ?

like image 877
Nils Avatar asked Oct 02 '09 13:10

Nils


2 Answers

Here's the Answer:
There is no easy way

I found a StreamRegex-Class which could be able to do what I am looking for.
From what I could grasp of the algorithm:

  • Start at the beginning of the file with an empty buffer
  • do (
    • add a chunk of the file to the buffer
    • if there is a match in the buffer
      • mark the match
      • drop all data which appeared before the end of the match from the buffer
  • ) while there is still something of the file left

That way it is not nessesary to load the full file -- or at least the chances of loading the full file in memory are reduced...
However: Worst case is that there is no match in the whole file - in this case the full file will be loaded into memory.

like image 112
Nils Avatar answered Sep 28 '22 01:09

Nils


Regex is not the way to go, especially not with these large amounts of text. Create a little parser of your own:

  • read the file line by line;
  • for each line:
    • loop through the line char by char keeping track of any opening/closing string literals
    • when you encounter '/*' (and you're not 'inside' a string), store that offset number and loop until you encounter the first '*/' and store that number as well

That will give you all the starting- and closing-offset numbers of the comment blocks. You should now be able to replace them by creating a temp-file and writing the text from the original file to the temp file (and writing something else if you're inside a comment block of course).

Edit: source files of 2GiB??

like image 30
Bart Kiers Avatar answered Sep 28 '22 00:09

Bart Kiers