Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find duplicate records in large text file

I'm on a linux machine (Redhat) and I have an 11GB text file. Each line in the text file contains data for a single record and the first n characters of the line contains a unique identifier for the record. The file contains a little over 27 million records.

I need to verify that there are not multiple records with the same unique identifier in the file. I also need to perform this process on an 80GB text file so any solution that requires loading the entire file into memory would not be practical.

like image 758
Justin Kredible Avatar asked May 02 '13 21:05

Justin Kredible


People also ask

Can Notepad ++ find duplicates?

Find duplicates and delete all in notepad++n)\1+ Replace with: (Nothing, leave empty) Check Regular Expression in the lower left. Click Replace All.

How do I find duplicate records in a text file in Unix?

The uniq command in Linux is used to display identical lines in a text file. This command can be helpful if you want to remove duplicate words or strings from a text file. Since the uniq command matches adjacent lines for finding redundant copies, it only works with sorted text files.


2 Answers

Read the file line-by-line, so you don't have to load it all into memory.

For each line (record) create a sha256 hash (32 bytes), unless your identifier is shorter.

Store the hashes/identifiers in an numpy.array. That is probably the most compact way to store them. 27 million records times 32 bytes/hash is 864 MB. That should fit into the memory of decent machine these days.

To speed up access you could use the first e.g. 2 bytes of the hash as the key of a collections.defaultdict and put the rest of the hashes in a list in the value. This would in effect create a hash table with 65536 buckets. For 27e6 records, each bucket would contain on average a list of around 400 entries. It would mean faster searching than a numpy array, but it would use more memory.

d = collections.defaultdict(list)
with open('bigdata.txt', 'r') as datafile:
    for line in datafile:
        id = hashlib.sha256(line).digest()
        # Or id = line[:n]
        k = id[0:2]
        v = id[2:]
        if v in d[k]:
            print "double found:", id
        else:
            d[k].append(v)
like image 77
Roland Smith Avatar answered Oct 10 '22 10:10

Roland Smith


Rigth tool for the job: put your records into a database. Unless you have a Postgres or MySQL installation handy already, I'd take sqlite.

$ sqlite3 uniqueness.sqlite
create table chk (
  ident char(n), -- n as in first n characters
  lineno integer -- for convenience
);
^D

Then I'd insert the unique identifier and line number into that table, possibly using a Python script like this:

import sqlite3 # install pysqlite3 before this
n = ... # how many chars are in the key part
lineno = 0

conn = sqlite3.connect("uniqueness.sqlite")
cur = conn.cursor()
with open("giant-file") as input:
  for line in input:
    lineno +=1
    ident = line[:n]
    cur.execute("insert into chk(ident, lineno) values(?, ?)", [ident, lineno])
cur.close()
conn.close()

After this, you can index the table and use SQL:

$ sqlite3 uniqueness.sqlite
create index x_ident on chk(ident); -- may take a bit of time

-- quickly find duplicates, if any
select ident, count(ident) as how_many
from chk
group by ident
having count(ident) > 1;

-- find lines of specific violations, if needed
select lineno 
from chk
where ident = ...; -- insert a duplicate ident

Yes, I tried most of this code, it should work :)

like image 40
9000 Avatar answered Oct 10 '22 10:10

9000