I have to remove duplicate strings from extremely big text file (100 Gb+)
Since in memory duplicate removing is hopeless due to size of data, I have tried bloomfilter but of no use beyond something like 50 millions strings ..
total strings are like 1 trillion+
I want to know what are the ways to solve this problem..
My initial attempt is, dividing the file in to number of sub files , sort each file and then merge all files together...
If you have better solution than this please let me know,
Thanks..
Remove duplicate lines with uniq If you don't need to preserve the order of the lines in the file, using the sort and uniq commands will do what you need in a very straightforward way. The sort command sorts the lines in alphanumeric order. The uniq command ensures that sequential identical lines are reduced to one.
To start your duplicate search, go to File -> Find Duplicates or click the Find Duplicates button on the main toolbar. The Find Duplicates dialog will open, as shown below. The Find Duplicates dialog is intuitive and easy to use. The first step is to specify which folders should be searched for duplicates.
1. If you want to remove all duplicates but leave the highest ones, you can apply this formula =MAX(IF($A$2:$A$12=D2,$B$2:$B$12)), remember to press Shift + Ctrl + Enter keys. 2. In the above formulas, A2:A12 is the original list you need to remove duplicates from.
(Textwrangler is the free version of BBedit. It's a great little text editor that can do all kinds of tricks. You can download it here.) Make sure 'Delete Duplicate Lines is selected in the next dialog box, then hit 'Process' and all the duplicate lines will be removed!
The key concept you are looking for here is external sorting. You should be able to merge sort the whole file using the techniques described in that article and then run through it sequentially to remove duplicates.
If the article is not clear enough have a look at the referenced implementations such as this one.
You can make second file, which contains records, each record is 64-bit CRC plus offset of the string and file should be indexed for fast search. Something like this:
ReadFromSourceAndSort()
{
offset=0;
while(!EOF)
{
string = ReadFromFile();
crc64 = crc64(string);
if(lookUpInCache(crc64))
{
skip;
} else {
WriteToCacheFile(crc64, offset);
WriteToOutput(string);
}
}
}
How to make good cachefile? It should be sorted by CRC64 to search fast. So you shuold to make structure of this file like binary searching tree, but with fast adding of new items without moving existing in the file. To improve speed you need to use Memory Mapped Files.
Possible answer:
memory = ReserveMemory(100 Mb);
mapfile= MapMemoryToFile(memory, "\\temp\\map.tmp"); (File can be bigger, Mapping is just window)
currentWindowNumber = 0;
while(!EndOfFile)
{
ReadFromSourceAndSort(); But only for first 100 Mb in memory
currentWindowNumber++;
MoveMapping(currentWindowNumber)
}
And Function To lookup; Shuld not use mapping (because each window switching saves 100 Mb to HDD and loads 100 Mb of the next window). Just seeks in 100Mb Trees of CRC64 and if CRC64 found -> string is already stored
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With