Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I search for a pattern within a text file using Python combining regex & string/file operations and store instances of the pattern?

So essentially I'm looking for specifically a 4 digit code within two angle brackets within a text file. I know that I need to open the text file and then parse line by line, but I am not sure the best way to go about structuring my code after checking "for line in file".

I think I can either somehow split it, strip it, or partition, but I also wrote a regex which I used compile on and so if that returns a match object I don't think I can use that with those string based operations. Also I'm not sure whether my regex is greedy enough or not...

I'd like to store all instances of those found hits as strings within either a tuple or a list.

Here is my regex:

regex = re.compile("(<(\d{4,5})>)?")

I don't think I need to include all that much code considering its fairly basic so far.

like image 962
Carl Carlson Avatar asked May 07 '12 05:05

Carl Carlson


People also ask

How do I search for a specific pattern in a file in Python?

Use re. finditer() to find patterns in a text file compile(pattern) to return a regular expression object. Use the for-loop syntax for item in iterable with iterable as an opened file using open(file) with the file name as file to loop over each line of the file.

How do I match a pattern in regex?

Most characters, including all letters ( a-z and A-Z ) and digits ( 0-9 ), match itself. For example, the regex x matches substring "x" ; z matches "z" ; and 9 matches "9" . Non-alphanumeric characters without special meaning in regex also matches itself. For example, = matches "=" ; @ matches "@" .


2 Answers

import re
pattern = re.compile("<(\d{4,5})>")

for i, line in enumerate(open('test.txt')):
    for match in re.finditer(pattern, line):
        print 'Found on line %s: %s' % (i+1, match.group())

A couple of notes about the regex:

  • You don't need the ? at the end and the outer (...) if you don't want to match the number with the angle brackets, but only want the number itself
  • It matches either 4 or 5 digits between the angle brackets

Update: It's important to understand that the match and capture in a regex can be quite different. The regex in my snippet above matches the pattern with angle brackets, but I ask to capture only the internal number, without the angle brackets.

More about regex in python can be found here : Regular Expression HOWTO

like image 166
Eli Bendersky Avatar answered Oct 06 '22 01:10

Eli Bendersky


Doing it in one bulk read:

import re

textfile = open(filename, 'r')
filetext = textfile.read()
textfile.close()
matches = re.findall("(<(\d{4,5})>)?", filetext)

Line by line:

import re

textfile = open(filename, 'r')
matches = []
reg = re.compile("(<(\d{4,5})>)?")
for line in textfile:
    matches += reg.findall(line)
textfile.close()

But again, the matches that returns will not be useful for anything except counting unless you added an offset counter:

import re

textfile = open(filename, 'r')
matches = []
offset = 0
reg = re.compile("(<(\d{4,5})>)?")
for line in textfile:
    matches += [(reg.findall(line),offset)]
    offset += len(line)
textfile.close()

But it still just makes more sense to read the whole file in at once.

like image 42
Josiah Avatar answered Oct 06 '22 01:10

Josiah