I have a file with lots of sections in this format:
section_name_1 <attribute_1:value> <attribute_2:value> ... <attribute_n:value> {
field_1 finish_num:start_num some_text ;
field_2 finish_num:start_num some_text ;
...
field_n finish_num:start_num some_text;
};
section_name_2 ...
... and so on
The file can be hundreds of thousands of lines long. The number of attributes and fields for each section can be different. I'd like to build a few dictionaries to hold some of these values. I have a separate dictionary already which holds all the possible 'attribute' values.
import os, re
from collections import defaultdict
def mapFile(myFile, attributeMap_d):
valueMap_d = {}
fieldMap_d = defaultdict(dict)
for attributeName in attributeMap_d:
valueMap_d[attributeName] = {}
count = 0
with open(myFile, "rb") as fh:
for line in fh:
# only look for lines with <
if '<' in line:
# match all attribute:value pairs inside <> brackets
attributeAllMatch = re.findall(r'<(\S+):(\S+)>', line)
attributeAllMatchLen = len(attributeAllMatch)
count = 0
sectionNameMatch = re.match(r'(\S+)\s+<', line)
# store each section name and its associated attribute and value into dict
for attributeName in attributeMap_d:
for element in attributeAllMatch:
if element[0] == attributeName:
valueMap_d[attributeName][sectionNameMatch.group(1).rstrip()] = element[1].rstrip()
count += 1
# stop searching if all attributes in section already matched
if count == attributeAllMatchLen: break
nextLine = next(fh)
#in between each squiggly bracket, store all the field names and start/stop_nums into dict
#this while loop is very slow...
while not "};" in nextLine:
fieldMatch = re.search(r'(\S+)\s+(\d+):(\d+)', nextLine)
if fieldMatch:
fieldMap_d[sectionNameMatch.group(1)][fieldMatch.group(1)] = [fieldMatch.group(2), fieldMatch.group(3)]
nextLine = next(fh)
return valueMap_d
My problem is that the while loop that matches all the field values is noticeably slower than the rest of the code: 0.5s vs. 2.2s according to cProfile if I remove the while loop. I'm wondering what I can do to speed it up.
Regex is great when you need fancy pattern matching, but when you don't need that it can be faster to parse text using str
methods. Here's some code that compares the timing of doing the field parsing using your regex vs doing it with str.split
.
First I create some fake test data which I store in the rows
list. Doing this makes my demo code simpler than if I were reading the data from a file, but more importantly, it eliminates the overhead of file reading, so we can more accurately compare the parsing speed.
BTW, you should save sectionNameMatch.group(1)
outside the field parsing loop, rather than having to make that call on every field line.
Firstly, I'll illustrate that my code parses the data correctly. :)
import re
from pprint import pprint
from time import perf_counter
# Make some test data
num = 10
rows = []
for i in range(1, num):
j = 100 * i
rows.append(' field_{:03} {}:{} some_text here ;'.format(i, j, j - 50))
rows.append('};')
print('\n'.join(rows))
# Select whether to use regex to do the parsing or `str.split`
use_regex = True
print('Testing {}'.format(('str.split', 'regex')[use_regex]))
fh = iter(rows)
fieldMap = {}
nextLine = next(fh)
start = perf_counter()
if use_regex:
while not "};" in nextLine:
fieldMatch = re.search(r'(\S+)\s+(\d+):(\d+)', nextLine)
if fieldMatch:
fieldMap[fieldMatch.group(1)] = [fieldMatch.group(2), fieldMatch.group(3)]
nextLine = next(fh)
else:
while not "};" in nextLine:
if nextLine:
data = nextLine.split(maxsplit=2)
fieldMap[data[0]] = data[1].split(':')
nextLine = next(fh)
print('time: {:.6f}'.format(perf_counter() - start))
pprint(fieldMap)
output
field_001 100:50 some_text here ;
field_002 200:150 some_text here ;
field_003 300:250 some_text here ;
field_004 400:350 some_text here ;
field_005 500:450 some_text here ;
field_006 600:550 some_text here ;
field_007 700:650 some_text here ;
field_008 800:750 some_text here ;
field_009 900:850 some_text here ;
};
Testing regex
time: 0.001946
{'field_001': ['100', '50'],
'field_002': ['200', '150'],
'field_003': ['300', '250'],
'field_004': ['400', '350'],
'field_005': ['500', '450'],
'field_006': ['600', '550'],
'field_007': ['700', '650'],
'field_008': ['800', '750'],
'field_009': ['900', '850']}
Here's the output with use_regex = False
; I won't bother re-printing the input data.
Testing str.split
time: 0.000100
{'field_001': ['100', '50'],
'field_002': ['200', '150'],
'field_003': ['300', '250'],
'field_004': ['400', '350'],
'field_005': ['500', '450'],
'field_006': ['600', '550'],
'field_007': ['700', '650'],
'field_008': ['800', '750'],
'field_009': ['900', '850']}
Now for the real test. I'll set num = 200000
and comment out the lines that print the input & output data.
Testing regex
time: 3.640832
Testing str.split
time: 2.480094
As you can see, the regex version is around 50% slower.
Those timings were obtained on my ancient 2GHz 32 bit machine running Python 3.6.0, so your speeds may be different. ;) If your Python doesn't have time.perf_counter
, you can use time.time
instead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With