I have a very large data file (255G; 3,192,563,934 lines). Unfortunately I only have 204G of free space on the device (and no other devices I can use). I did a random sample and found that in a given, say, 100K lines, there are about 10K unique lines... but the file isn't sorted.
Normally I would use, say:
pv myfile.data | sort | uniq > myfile.data.uniq
and just let it run for a day or so. That won't work in this case because I don't have enough space left on the device for the temporary files.
I was thinking I could use split
, perhaps, and do a streaming uniq
on maybe 500K lines at a time into a new file. Is there a way to do something like that?
I thought I might be able to do something like
tail -100000 myfile.data | sort | uniq >> myfile.uniq && trunc --magicstuff myfile.data
but I couldn't figure out a way to truncate the file properly.
Use sort -u
instead of sort | uniq
This allows sort
to discard duplicates earlier, and GNU coreutils is smart enough to take advantage of this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With