Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove of duplicate strings from very big text file

I have to remove duplicate strings from extremely big text file (100 Gb+)

Since in memory duplicate removing is hopeless due to size of data, I have tried bloomfilter but of no use beyond something like 50 millions strings ..

total strings are like 1 trillion+

I want to know what are the ways to solve this problem..

My initial attempt is, dividing the file in to number of sub files , sort each file and then merge all files together...

If you have better solution than this please let me know,

Thanks..

like image 865
Shivraj Avatar asked Mar 22 '12 03:03

Shivraj


People also ask

How do I remove duplicate lines in files?

Remove duplicate lines with uniq If you don't need to preserve the order of the lines in the file, using the sort and uniq commands will do what you need in a very straightforward way. The sort command sorts the lines in alphanumeric order. The uniq command ensures that sequential identical lines are reduced to one.

How do I find duplicates in a text file?

To start your duplicate search, go to File -> Find Duplicates or click the Find Duplicates button on the main toolbar. The Find Duplicates dialog will open, as shown below. The Find Duplicates dialog is intuitive and easy to use. The first step is to specify which folders should be searched for duplicates.

How do I remove duplicate rows and keep the highest value only?

1. If you want to remove all duplicates but leave the highest ones, you can apply this formula =MAX(IF($A$2:$A$12=D2,$B$2:$B$12)), remember to press Shift + Ctrl + Enter keys. 2. In the above formulas, A2:A12 is the original list you need to remove duplicates from.

How do I remove duplicates from BBedit?

(Textwrangler is the free version of BBedit. It's a great little text editor that can do all kinds of tricks. You can download it here.) Make sure 'Delete Duplicate Lines is selected in the next dialog box, then hit 'Process' and all the duplicate lines will be removed!


2 Answers

The key concept you are looking for here is external sorting. You should be able to merge sort the whole file using the techniques described in that article and then run through it sequentially to remove duplicates.

If the article is not clear enough have a look at the referenced implementations such as this one.

like image 56
Slugart Avatar answered Sep 21 '22 23:09

Slugart


You can make second file, which contains records, each record is 64-bit CRC plus offset of the string and file should be indexed for fast search. Something like this:

ReadFromSourceAndSort()
{
   offset=0;
   while(!EOF)
   {
      string = ReadFromFile();
      crc64 = crc64(string);
      if(lookUpInCache(crc64))
      {
         skip;
      } else {
         WriteToCacheFile(crc64, offset);
         WriteToOutput(string);
      }
   }
}

How to make good cachefile? It should be sorted by CRC64 to search fast. So you shuold to make structure of this file like binary searching tree, but with fast adding of new items without moving existing in the file. To improve speed you need to use Memory Mapped Files.

Possible answer:

memory = ReserveMemory(100 Mb);
mapfile= MapMemoryToFile(memory, "\\temp\\map.tmp"); (File can be bigger, Mapping is just window)
currentWindowNumber = 0;

while(!EndOfFile)
{
  ReadFromSourceAndSort(); But only for first 100 Mb in memory
  currentWindowNumber++;
  MoveMapping(currentWindowNumber)
}

And Function To lookup; Shuld not use mapping (because each window switching saves 100 Mb to HDD and loads 100 Mb of the next window). Just seeks in 100Mb Trees of CRC64 and if CRC64 found -> string is already stored

like image 36
Alexus Avatar answered Sep 20 '22 23:09

Alexus