Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding lines in a text file matching a regular expression

Tags:

Can anyone explain how I could use regular expressions in Ruby to only return the matches of a string.

For example, if the code reads in a .txt file with a series of names in it:

John Smith
James Jones
David Brown
Tom Davidson
etc etc

..and the word to match is typed in as being 'ohn', it would then just return 'John Smith', but none of the other names.

like image 916
Jbod Avatar asked May 14 '11 15:05

Jbod


3 Answers

Note: Instead of using File.each_line, use IO.foreach in modern Rubies instead. For instance:

[1] pry(main)> IO.foreach('./.bashrc') do |l|
[1] pry(main)*   puts l
[1] pry(main)* end
export PATH=~/bin:$PATH
export EDITOR='vi'
export VISUAL=$EDITOR

Progress happens and things change.


Here are some different ways to get where you're going.

Notice first I'm using a more idiomatic way of writing the code for reading lines from a file. Ruby's IO and File libraries make it very easy to open, read and close the file in a nice neat package.

File.each_line('file.txt') do |li|
  puts li if (li['ohn'])
end

That looks for 'ohn' anywhere in the line, but doesn't bother with a regular expression.

File.each_line('file.txt') do |li|
  puts li if (li[/ohn/])
end

That looks for the same string, only it uses a regex to get there. Functionally it's the same as the first example.

File.each_line('file.txt') do |li|
  puts li if (li[/ohn\b/])
end

This is a bit smarter way of looking for names that end with 'ohn'. It uses regex but also specifies that the pattern has to occur at the end of a word. \b means "word-boundary".

Also, when reading files, it's important to always think ahead about whether the file being read could ever exceed the RAM available to your app. It's easy to read an entire file into memory in one pass, then process it from RAM, but you can cripple or kill your app or machine if you exceed the physical RAM available to you.


Do you know if the code shown by the other answers is in fact loading the entire file into RAM or is somehow optimized by streaming from the readlines function to the select function?

From the IO#readlines documentation:

Reads the entire file specified by name as individual lines, and returns those lines in an array. Lines are separated by sep.

An additional consideration is memory allocation during a large, bulk read. Even if you have sufficient RAM, you can run into situations where a language chokes as it reads in the data, finds it hasn't allocated enough memory to the variable, and has to pause as it grabs more. That cycle repeats until the entire file is loaded.

I became sensitive to this many years ago when I was loading a very big data file into a Perl app on HP's biggest mini, that I managed. The app would pause for a couple seconds periodically and I couldn't figure out why. I dropped into the debugger and couldn't find the problem. Finally, by tracing the run using old-school print statements I isolated the pauses to a file "slurp". I had plenty of RAM, and plenty of processing power, but Perl wasn't allocating enough memory. I switched to reading line by line and the app flew through its processing. Ruby, like Perl, has good I/O, and can read a big file very quickly when it's reading line-by-line. I have never found a good reason for slurping a text file, except when it's possible to have content I want spread across several lines, but that is not a common occurrence.

like image 133
the Tin Man Avatar answered Sep 28 '22 05:09

the Tin Man


Maybe I'm not understanding the problem fully, but you could do something like:

File.readlines("path/to/file.txt").select { |line| line =~ /ohn/ }

to get an array of all the lines that match your criteria.

like image 43
jxpx777 Avatar answered Sep 28 '22 04:09

jxpx777


query = 'ohn'
names = File.readlines('names.txt')
matches = names.select { |name| name[/#{query}/i] }
#=> ["John Smith"]

Remove the i at the end of the regex if you wish the query to be case sensitive.

like image 33
Douglas F Shearer Avatar answered Sep 28 '22 05:09

Douglas F Shearer