How might I remove duplicate lines from a file?

People also ask

How do I remove duplicate lines in files?

Remove duplicate lines with uniq If you don't need to preserve the order of the lines in the file, using the sort and uniq commands will do what you need in a very straightforward way. The sort command sorts the lines in alphanumeric order. The uniq command ensures that sequential identical lines are reduced to one.

How do I find duplicates in a text file?

To start your duplicate search, go to File -> Find Duplicates or click the Find Duplicates button on the main toolbar. The Find Duplicates dialog will open, as shown below. The Find Duplicates dialog is intuitive and easy to use. The first step is to specify which folders should be searched for duplicates.

How do I sort and remove duplicate lines in Linux?

To remove duplicate lines from a sorted file and make it unique, we use the uniq command in the Linux system. The uniq command work as a kind of filter program that reports out the duplicate lines in a file. It filters adjacent matching lines from the input and gives a unique output.

On Unix/Linux, use the uniq command, as per David Locke's answer, or sort, as per William Pursell's comment.

If you need a Python script:

Click to copy

lines_seen = set() # holds lines already seen
outfile = open(outfilename, "w")
for line in open(infilename, "r"):
    if line not in lines_seen: # not a duplicate
        outfile.write(line)
        lines_seen.add(line)
outfile.close()

Update: The sort/uniq combination will remove duplicates but return a file with the lines sorted, which may or may not be what you want. The Python script above won't reorder lines, but just drop duplicates. Of course, to get the script above to sort as well, just leave out the outfile.write(line) and instead, immediately after the loop, do outfile.writelines(sorted(lines_seen)).

If you're on *nix, try running the following command:

Click to copy

sort <file name> | uniq

Click to copy

uniqlines = set(open('/tmp/foo').readlines())

this will give you the list of unique lines.

writing that back to some file would be as easy as:

Click to copy

bar = open('/tmp/bar', 'w').writelines(set(uniqlines))

bar.close()

You can do:

Click to copy

import os
os.system("awk '!x[$0]++' /path/to/file > /path/to/rem-dups")

Here You are using bash into python :)

You have also other way:

Click to copy

with open('/tmp/result.txt') as result:
        uniqlines = set(result.readlines())
        with open('/tmp/rmdup.txt', 'w') as rmdup:
            rmdup.writelines(set(uniqlines))

get all your lines in the list and make a set of lines and you are done. for example,

Click to copy

>>> x = ["line1","line2","line3","line2","line1"]
>>> list(set(x))
['line3', 'line2', 'line1']
>>>

If you need to preserve the ordering of lines - as set is unordered collection - try this:

Click to copy

y = []
for l in x:
    if l not in y:
        y.append(l)

and write the content back to the file.

Its a rehash of whats already been said here - here what I use.

Click to copy

import optparse

def removeDups(inputfile, outputfile):
        lines=open(inputfile, 'r').readlines()
        lines_set = set(lines)
        out=open(outputfile, 'w')
        for line in lines_set:
                out.write(line)

def main():
        parser = optparse.OptionParser('usage %prog ' +\
                        '-i <inputfile> -o <outputfile>')
        parser.add_option('-i', dest='inputfile', type='string',
                        help='specify your input file')
        parser.add_option('-o', dest='outputfile', type='string',
                        help='specify your output file')
        (options, args) = parser.parse_args()
        inputfile = options.inputfile
        outputfile = options.outputfile
        if (inputfile == None) or (outputfile == None):
                print parser.usage
                exit(1)
        else:
                removeDups(inputfile, outputfile)

if __name__ == '__main__':
        main()

Python One liners :

Click to copy

python -c "import sys; lines = sys.stdin.readlines(); print ''.join(sorted(set(lines)))" < InputFile > OutputFile

Related questions
                            
                                No module named tensorflow in jupyter
                            
                                String formatting in Python
                            
                                SSL module in Python is not available (on OSX)
                            
                                How do I use allow_tags in django 2.0 admin?
                            
                                Python JSON module has no attribute 'dumps'
                            
                                Column alias after groupBy in pyspark
                            
                                In Python, what is the difference between pass and return
                            
                                How to import csv data file into scikit-learn?
                            
                                Python try/except: Showing the cause of the error after displaying my variables
                            
                                is there a way to track the number of times a function is called?
                            
                                redirect prints to log file
                            
                                How do I know if my list has all 1s?
                            
                                Imbalance in scikit-learn
                            
                                Python - 'ascii' codec can't decode byte
                            
                                Where can I find mad (mean absolute deviation) in scipy?
                            
                                Rotating an image with orientation specified in EXIF using Python without PIL including the thumbnail
                            
                                In Python 2.4, how can I strip out characters after ';'?
                            
                                Python "Every Other Element" Idiom [duplicate]
                            
                                How to generate a random 4 digit number not starting with 0 and having unique digits?
                            
                                Create a List that contain each Line of a File

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How might I remove duplicate lines from a file?

Tags:

python

text

file-io

People also ask

Recent Activity

Donate For Us