So essentially I'm looking for specifically a 4 digit code within two angle brackets within a text file. I know that I need to open the text file and then parse line by line, but I am not sure the best way to go about structuring my code after checking "for line in file".
I think I can either somehow split it, strip it, or partition, but I also wrote a regex which I used compile on and so if that returns a match object I don't think I can use that with those string based operations. Also I'm not sure whether my regex is greedy enough or not...
I'd like to store all instances of those found hits as strings within either a tuple or a list.
Here is my regex:
regex = re.compile("(<(\d{4,5})>)?")
I don't think I need to include all that much code considering its fairly basic so far.
Use re. finditer() to find patterns in a text file compile(pattern) to return a regular expression object. Use the for-loop syntax for item in iterable with iterable as an opened file using open(file) with the file name as file to loop over each line of the file.
Most characters, including all letters ( a-z and A-Z ) and digits ( 0-9 ), match itself. For example, the regex x matches substring "x" ; z matches "z" ; and 9 matches "9" . Non-alphanumeric characters without special meaning in regex also matches itself. For example, = matches "=" ; @ matches "@" .
import re
pattern = re.compile("<(\d{4,5})>")
for i, line in enumerate(open('test.txt')):
for match in re.finditer(pattern, line):
print 'Found on line %s: %s' % (i+1, match.group())
A couple of notes about the regex:
?
at the end and the outer (...)
if you don't want to match the number with the angle brackets, but only want the number itselfUpdate: It's important to understand that the match and capture in a regex can be quite different. The regex in my snippet above matches the pattern with angle brackets, but I ask to capture only the internal number, without the angle brackets.
More about regex in python can be found here : Regular Expression HOWTO
Doing it in one bulk read:
import re
textfile = open(filename, 'r')
filetext = textfile.read()
textfile.close()
matches = re.findall("(<(\d{4,5})>)?", filetext)
Line by line:
import re
textfile = open(filename, 'r')
matches = []
reg = re.compile("(<(\d{4,5})>)?")
for line in textfile:
matches += reg.findall(line)
textfile.close()
But again, the matches that returns will not be useful for anything except counting unless you added an offset counter:
import re
textfile = open(filename, 'r')
matches = []
offset = 0
reg = re.compile("(<(\d{4,5})>)?")
for line in textfile:
matches += [(reg.findall(line),offset)]
offset += len(line)
textfile.close()
But it still just makes more sense to read the whole file in at once.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With