Remove of duplicate strings from very big text file

2 Answers

The key concept you are looking for here is external sorting. You should be able to merge sort the whole file using the techniques described in that article and then run through it sequentially to remove duplicates.

If the article is not clear enough have a look at the referenced implementations such as this one.

answered Sep 21 '22 23:09

Slugart

You can make second file, which contains records, each record is 64-bit CRC plus offset of the string and file should be indexed for fast search. Something like this:

ReadFromSourceAndSort()
{
   offset=0;
   while(!EOF)
   {
      string = ReadFromFile();
      crc64 = crc64(string);
      if(lookUpInCache(crc64))
      {
         skip;
      } else {
         WriteToCacheFile(crc64, offset);
         WriteToOutput(string);
      }
   }
}

How to make good cachefile? It should be sorted by CRC64 to search fast. So you shuold to make structure of this file like binary searching tree, but with fast adding of new items without moving existing in the file. To improve speed you need to use Memory Mapped Files.

Possible answer:

memory = ReserveMemory(100 Mb);
mapfile= MapMemoryToFile(memory, "\\temp\\map.tmp"); (File can be bigger, Mapping is just window)
currentWindowNumber = 0;

while(!EndOfFile)
{
  ReadFromSourceAndSort(); But only for first 100 Mb in memory
  currentWindowNumber++;
  MoveMapping(currentWindowNumber)
}

And Function To lookup; Shuld not use mapping (because each window switching saves 100 Mb to HDD and loads 100 Mb of the next window). Just seeks in 100Mb Trees of CRC64 and if CRC64 found -> string is already stored

answered Sep 20 '22 23:09

Alexus

Related questions
                            
                                Purpose of PureAttribute on parameter
                            
                                System.data.Sqlite with EF6
                            
                                How do I use a geospatial query in the 2.1 MongoDB C# driver?
                            
                                Prerendering/hiding on load with WPF MVVM?
                            
                                Define own keywords and meanings
                            
                                Change the TextBox highlight color when a user selects text?
                            
                                VS Crashing after 'Set As StartUp Project'
                            
                                Using JSON Patch to add values to a dictionary
                            
                                Does SignInAsAuthenticationType allow me to get an OAuth token without overwriting existing claims?
                            
                                TCP support in Azure IoT Hub
                            
                                Unity/Firebase How to authenticate using Google?
                            
                                The property 'x' is not a navigation property of entity type 'y'
                            
                                ASP.NET Core 2.0 Web App Deployment and Hosting [closed]
                            
                                C# compiler chooses wrong extension method
                            
                                User Configuration Settings in .NET Core
                            
                                Understanding VS2010 C# parallel profiling results
                            
                                Ship maritime AIS information API
                            
                                _MailAutoSig Bookmark missing (Outlook 2010)
                            
                                Deserialize random/unknown types with XmlSerializer [duplicate]
                            
                                Entity Framework associations with multiple (separate) keys on view

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Remove of duplicate strings from very big text file

Tags:

string

c#

duplicates

Shivraj

People also ask

2 Answers

Slugart

Alexus

Recent Activity

Donate For Us