I have around a 50GB folder full of files. Each file consists of line after line of JSON data and in this JSON structure is a field for user_id.
I need to count the number of unique User IDs across all of the files (and only need the total count). What is the most memory efficient and relatively quick way of counting these?
Of course, loading everything into a huge list maybe isn't the best option. I tried pandas but it took quite a while. I then tried to simple write the IDs to text files but I thought I'd find out if I was maybe missing something far simpler.
Since it was stated that the JSON context of user_id does not matter, we just treat the JSON files as the pure text files they are.
I'd not use Python at all for this, but rather rely on the tools provided by GNU, and pipes:
cat *.json | sed -nE 's/\s*\"user_id\"\s*\:\s*\"([0-9]+)\"\s*/\1/p' | sort -un --parallel=4 | wc -l
cat *.json: Output contents of all files to stdoutsed -nE 's/\s*\"user_id\"\s*\:\s*\"([0-9]+)\"\s*/\1/p': Look for lines containting "user_id": "{number}" and only print the number to stdoutsort -un --parallel=4: Sort the output numerically, ignoring duplicates (i.e. output only unique values), using multiple (4) jobs, and output to stdoutwc -l: Count number of lines, and output to stdoutTo determine whether the values are unique, we just sort them. You can speed up the sorting by specifying a higher number of parallel jobs, depending on your core count.
If you want to use Python nonetheless, I'd recommend using a set and re (regular expressions)
import fileinput
import re
r = re.compile(r'\s*\"user_id\"\s*\:\s*\"([0-9]+)\"\s*')
s = set()
for line in fileinput.input():
m = r.match(line)
if m:
s.add(m.groups()[0])
print(len(s))
Run this using python3 <scriptname>.py *.json.
Since you only need the user_ids, load a .json (as a data stucture), extract any ids, then destroy all references to that structure and any its parts so that it's garbage collected.
To speed up the process, you can do this in a few processes in parallel, take a look at multiprocessing.Pool.map.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With