Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory efficient way to remove duplicate lines in a text file using C++

Tags:

c++

file

io

file-io

What is the most memory efficient way to remove duplicate lines in a large text file using C++?

Let me clarify, I'm not asking for code, just the best method. The duplicate lines are not guaranteed to be adjacent. I realize that an approach optimized for minimal memory usage would result in slower speeds however this is my restriction as the files are far too large.

like image 724
codinggoose Avatar asked Mar 18 '10 03:03

codinggoose


2 Answers

i would hash each line and then seek back to lines that have non-unique hashes and compare them individually (or in a buffered fashion). this would work well on files with a relatively low occurence of duplicates.

When you use a hash, you can set the memory used to a constant amount (i.e., you could have a tiny hash table with just 256 slots or something larger. in any case, the amount of mem can be restricted to any constant amount.) the values in the table are the offset of the lines with that hash. so you only need line_count*sizeof(int) plus a constant to maintain the hash table.

even simpler (but much slower) would be to scan the entire file for each line. but i prefer the first option. this is the most memory efficient option possible. you would only need to store 2 offsets and 2 bytes to do the comparison.

like image 174
user262976 Avatar answered Oct 17 '22 11:10

user262976


To minimize memory usage:

If you have unlimited (or very fast) disk i/o, you could write each line to its own file with the filename being the hash + some identifier indicating order (or no order, if order is irrelevant). In this manner, you use the file-system as extended memory. This should be much faster than re-scanning the whole file for each line.

As an addition from what those have said below, if you expect a high duplicate rate, you could maintain some threshold of the hashes in memory as well as in file. This would give much better results for high duplicate rates. Since the file is so large, I doubt n^2 is acceptable in processing time. My solution is O(n) in processing speed and O(1) in memory. It's O(n) in required disk space used during runtime, however, which other solutions do not have.

It sounds like you might be running on limited hardware of varied specs, so you'll want to test a number of duplicate removing algorithms and profile before you decide which is best for long-term implementation.

like image 3
Stefan Kendall Avatar answered Oct 17 '22 11:10

Stefan Kendall