Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to iterate through this text file faster?

I have a file with lots of sections in this format:

section_name_1 <attribute_1:value> <attribute_2:value> ... <attribute_n:value> {
    field_1 finish_num:start_num some_text ;
    field_2 finish_num:start_num some_text ;
    ...
    field_n finish_num:start_num some_text;
};

section_name_2 ...
... and so on

The file can be hundreds of thousands of lines long. The number of attributes and fields for each section can be different. I'd like to build a few dictionaries to hold some of these values. I have a separate dictionary already which holds all the possible 'attribute' values.

import os, re
from collections import defaultdict

def mapFile(myFile, attributeMap_d):
        valueMap_d = {}
        fieldMap_d = defaultdict(dict)

        for attributeName in attributeMap_d:
            valueMap_d[attributeName] = {}

        count = 0
        with open(myFile, "rb") as fh:
            for line in fh:
                # only look for lines with <
                if '<' in line:
                    # match all attribute:value pairs inside <> brackets
                    attributeAllMatch = re.findall(r'<(\S+):(\S+)>', line)
                    attributeAllMatchLen = len(attributeAllMatch)
                    count = 0

                    sectionNameMatch = re.match(r'(\S+)\s+<', line)

                    # store each section name and its associated attribute and value into dict
                    for attributeName in attributeMap_d:
                        for element in attributeAllMatch:
                            if element[0] == attributeName:
                                valueMap_d[attributeName][sectionNameMatch.group(1).rstrip()] = element[1].rstrip()
                                count += 1
                        # stop searching if all attributes in section already matched
                        if count == attributeAllMatchLen: break

                    nextLine = next(fh)

                    #in between each squiggly bracket, store all the field names and start/stop_nums into dict
                    #this while loop is very slow...
                    while not "};" in nextLine:
                        fieldMatch = re.search(r'(\S+)\s+(\d+):(\d+)', nextLine)
                        if fieldMatch:
                            fieldMap_d[sectionNameMatch.group(1)][fieldMatch.group(1)] = [fieldMatch.group(2), fieldMatch.group(3)]
                        nextLine = next(fh)

        return valueMap_d

My problem is that the while loop that matches all the field values is noticeably slower than the rest of the code: 0.5s vs. 2.2s according to cProfile if I remove the while loop. I'm wondering what I can do to speed it up.

like image 300
Colin Avatar asked Nov 25 '17 20:11

Colin


1 Answers

Regex is great when you need fancy pattern matching, but when you don't need that it can be faster to parse text using str methods. Here's some code that compares the timing of doing the field parsing using your regex vs doing it with str.split.

First I create some fake test data which I store in the rows list. Doing this makes my demo code simpler than if I were reading the data from a file, but more importantly, it eliminates the overhead of file reading, so we can more accurately compare the parsing speed.

BTW, you should save sectionNameMatch.group(1) outside the field parsing loop, rather than having to make that call on every field line.

Firstly, I'll illustrate that my code parses the data correctly. :)

import re
from pprint import pprint
from time import perf_counter

# Make some test data
num = 10
rows = []
for i in range(1, num):
    j = 100 * i
    rows.append(' field_{:03} {}:{} some_text here ;'.format(i, j, j - 50))
rows.append('};')
print('\n'.join(rows))

# Select whether to use regex to do the parsing or `str.split`
use_regex = True
print('Testing {}'.format(('str.split', 'regex')[use_regex]))

fh = iter(rows)
fieldMap = {}

nextLine = next(fh)
start = perf_counter()
if use_regex:
    while not "};" in nextLine: 
        fieldMatch = re.search(r'(\S+)\s+(\d+):(\d+)', nextLine)
        if fieldMatch:
            fieldMap[fieldMatch.group(1)] = [fieldMatch.group(2), fieldMatch.group(3)]
        nextLine = next(fh)
else:
    while not "};" in nextLine: 
        if nextLine:
            data = nextLine.split(maxsplit=2)
            fieldMap[data[0]] = data[1].split(':')
        nextLine = next(fh)

print('time: {:.6f}'.format(perf_counter() - start))
pprint(fieldMap)

output

 field_001 100:50 some_text here ;
 field_002 200:150 some_text here ;
 field_003 300:250 some_text here ;
 field_004 400:350 some_text here ;
 field_005 500:450 some_text here ;
 field_006 600:550 some_text here ;
 field_007 700:650 some_text here ;
 field_008 800:750 some_text here ;
 field_009 900:850 some_text here ;
};
Testing regex
time: 0.001946
{'field_001': ['100', '50'],
 'field_002': ['200', '150'],
 'field_003': ['300', '250'],
 'field_004': ['400', '350'],
 'field_005': ['500', '450'],
 'field_006': ['600', '550'],
 'field_007': ['700', '650'],
 'field_008': ['800', '750'],
 'field_009': ['900', '850']}

Here's the output with use_regex = False; I won't bother re-printing the input data.

Testing str.split
time: 0.000100
{'field_001': ['100', '50'],
 'field_002': ['200', '150'],
 'field_003': ['300', '250'],
 'field_004': ['400', '350'],
 'field_005': ['500', '450'],
 'field_006': ['600', '550'],
 'field_007': ['700', '650'],
 'field_008': ['800', '750'],
 'field_009': ['900', '850']}

Now for the real test. I'll set num = 200000 and comment out the lines that print the input & output data.

Testing regex
time: 3.640832

Testing str.split
time: 2.480094

As you can see, the regex version is around 50% slower.

Those timings were obtained on my ancient 2GHz 32 bit machine running Python 3.6.0, so your speeds may be different. ;) If your Python doesn't have time.perf_counter, you can use time.time instead.

like image 189
PM 2Ring Avatar answered Nov 05 '22 05:11

PM 2Ring