Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to grep multiple values from file in python

Tags:

python

grep

  • I have a file of 300m lines (inputFile), all with 2 columns separated by a tab.
  • I also have a list of 1000 unique items (vals).

I want to create a dictionary with column 1 as key and column 2 as value for all lines in inputFile where the first columns occurs in vals. A few items in vals do not occur in the file, these values have to be saved in a new list. I can use up to 20 threads to speed up this process.

What is the fastest way to achieve this?

My best try till now:

newDict = {}
foundVals = []
cmd = "grep \"" + vals[0]
for val in vals:
     cmd = cmd + "\|^"+val+"[[:space:]]"
cmd = cmd + "\" " + self.inputFile
p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
for line in iter(p.stdout.readline, ''):
    info = line.split()
    foundVals.append(info[0])
    newDict.update({info[0]:info[1]})
p.wait()
notFound = [x for x in vals if x not in set(foundVals)]

Example inputFile:

2       9913
3       9913
4       9646
...
594592886       32630
594592888       32630
594592890       32630

vals:

[1,2,594592888]

wanted dictionary:

{2:9913,594592888:32630}

And in notFound:

[1]
like image 553
Jetse Avatar asked Mar 18 '14 13:03

Jetse


1 Answers

You clarified in a comment that each key occurs at most once in the data. It follows from that and the fact that there are only 1000 keys that the amount of work being done in Python is trivial; almost all your time is spent waiting for output from grep. Which is fine; your strategy of delegating line extraction to a specialized utility remains sound. But it means that performance gains have to be found on the line-extraction side.

You can speed things up some by optimizing your regex. E.g., instead of

^266[[:space:]]\|^801[[:space:]]\|^810[[:space:]]

you could use:

^\(266\|801\|810\)[[:space:]]

so that the anchor doesn't have to be separately matched for each alternative. I see about a 15% improvement on test data (10 million rows, 25 keys) with that change.

A further optimization is to unify common prefixes in the alternation: 266\|801\|810 can be replaced with the equivalent 266\|8\(01\|10\). Rewriting the 25-key regex in this way gives close to a 50% speedup on test data.

At this point grep is starting to show its limits. It seems that it's CPU-bound: iostat reveals that each successive improvement in the regex increases the number of IO requests per second while grep is running. And re-running grep with a warmed page cache and the --mmap option doesn't speed things up (as it likely would if file IO were a bottleneck). Greater speed therefore probably requires a utility with a faster regex engine.

One such is ag(source here), whose regex implementation also performs automatic optimization, so you needn't do much hand-tuning. While I haven't been able to get grep to process the test data in less than ~12s on my machine, ag finishes in ~0.5s for all of the regex variants described above.

like image 169
Alp Avatar answered Sep 30 '22 20:09

Alp