Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How get unique lines from a very large file in linux?

I have a very large data file (255G; 3,192,563,934 lines). Unfortunately I only have 204G of free space on the device (and no other devices I can use). I did a random sample and found that in a given, say, 100K lines, there are about 10K unique lines... but the file isn't sorted.

Normally I would use, say:

pv myfile.data | sort | uniq > myfile.data.uniq

and just let it run for a day or so. That won't work in this case because I don't have enough space left on the device for the temporary files.

I was thinking I could use split, perhaps, and do a streaming uniq on maybe 500K lines at a time into a new file. Is there a way to do something like that?

I thought I might be able to do something like

tail -100000 myfile.data | sort | uniq >> myfile.uniq && trunc --magicstuff myfile.data

but I couldn't figure out a way to truncate the file properly.

like image 207
Sir Robert Avatar asked Dec 18 '22 05:12

Sir Robert


1 Answers

Use sort -u instead of sort | uniq

This allows sort to discard duplicates earlier, and GNU coreutils is smart enough to take advantage of this.

like image 128
that other guy Avatar answered Dec 31 '22 16:12

that other guy