I want to create a dictionary with column 1 as key and column 2 as value for all lines in inputFile where the first columns occurs in vals. A few items in vals do not occur in the file, these values have to be saved in a new list. I can use up to 20 threads to speed up this process.
What is the fastest way to achieve this?
My best try till now:
newDict = {}
foundVals = []
cmd = "grep \"" + vals[0]
for val in vals:
cmd = cmd + "\|^"+val+"[[:space:]]"
cmd = cmd + "\" " + self.inputFile
p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
for line in iter(p.stdout.readline, ''):
info = line.split()
foundVals.append(info[0])
newDict.update({info[0]:info[1]})
p.wait()
notFound = [x for x in vals if x not in set(foundVals)]
Example inputFile:
2 9913
3 9913
4 9646
...
594592886 32630
594592888 32630
594592890 32630
vals:
[1,2,594592888]
wanted dictionary:
{2:9913,594592888:32630}
And in notFound:
[1]
You clarified in a comment that each key occurs at most once in the data. It follows from that and the fact that there are only 1000 keys that the amount of work being done in Python is trivial; almost all your time is spent waiting for output from grep
. Which is fine; your strategy of delegating line extraction to a specialized utility remains sound. But it means that performance gains have to be found on the line-extraction side.
You can speed things up some by optimizing your regex. E.g., instead of
^266[[:space:]]\|^801[[:space:]]\|^810[[:space:]]
you could use:
^\(266\|801\|810\)[[:space:]]
so that the anchor doesn't have to be separately matched for each alternative. I see about a 15% improvement on test data (10 million rows, 25 keys) with that change.
A further optimization is to unify common prefixes in the alternation: 266\|801\|810
can be replaced with the equivalent 266\|8\(01\|10\)
. Rewriting the 25-key regex in this way gives close to a 50% speedup on test data.
At this point grep
is starting to show its limits. It seems that it's CPU-bound: iostat
reveals that each successive improvement in the regex increases the number of IO requests per second while grep is running. And re-running grep
with a warmed page cache and the --mmap
option doesn't speed things up (as it likely would if file IO were a bottleneck). Greater speed therefore probably requires a utility with a faster regex engine.
One such is ag
(source here), whose regex implementation also performs automatic optimization, so you needn't do much hand-tuning. While I haven't been able to get grep
to process the test data in less than ~12s on my machine, ag
finishes in ~0.5s for all of the regex variants described above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With