Faster way to remove duplicates from a very large text file in Python?

Question

I have a very large text file with duplicate entries which I want to eliminate. I do not care about the order of the entries because the file will later be sorted.

Here is what I have so far:

unique_lines = set()
outfile = open("UniqueMasterList.txt", "w", encoding = "latin-1")

with open("MasterList.txt", "r", encoding = "latin-1") as infile:
    for line in infile:
        if line not in unique_lines:
            outfile.write(line)
            unique_lines.add(line)

outfile.close()

It has been running for 30 minutes and has not finished. I need it to be faster. What is a faster approach in Python?

Prune · Accepted Answer

Look for the corresponding system command. In Linux/UNIX, you would use

uniq MasterList.txt > UniqueMasterList.txt

The OS generally knows the best way to do these things.

post-comment edit

@Mark Ransom reminded me that uniq depends on matching lines being contiguous in the file. The simplest way to achieve this is to sort the file:

sort MasterList.txt | uniq > UniqueMasterList.txt

Mark Ransom · Answer

To use the same technique as uniq, in Python:

import itertools
with open("MasterList.txt", "r", encoding = "latin-1") as infile:
    sorted_file = sorted(infile.readlines())
for line, _ in itertools.groupby(sorted_file):
    outfile.write(line)

This presumes that the entire file will fit into memory, twice. Or that the file is already sorted and you can skip that step.

Faster way to remove duplicates from a very large text file in Python?

Tags:

python

text

python-3.x

duplicates

2 Answers

Prune

Mark Ransom

Recent Activity

Donate For Us

Faster way to remove duplicates from a very large text file in Python?

Tags:

python

text

python-3.x

duplicates

2 Answers

Prune

Mark Ransom

Related questions

Recent Activity

Donate For Us