I have downloaded a CSV file from Hotmail, but it has a lot of duplicates in it. These duplicates are complete copies and I don't know why my phone created them. I want to get rid of the duplicates. Technical specification: <pre class="prettyprint"> Windows XP SP 3 Python 2.7 CSV file with 400 contacts </pre>

You can use the following script: pre-condition: <ol> <li> <code>1.csv</code> is the file that consists the duplicates</li> <li> <code>2.csv</code> is the output file that will be devoid of the duplicates once this script is executed.</li> </ol> code <pre class="prettyprint"> <code> inFile = open('1.csv','r') outFile = open('2.csv','w') listLines = [] for line in inFile: if line in listLines: continue else: outFile.write(line) listLines.append(line) outFile.close() inFile.close() </code></pre> Algorithm Explanation Here, what I am doing is: <ol> <li>opening a file in the read mode. This is the file that has the duplicates. </li> <li>Then in a loop that runs till the file is over, we check if the line has already encountered. </li> <li>If it has been encountered than we don't write it to the output file. </li> <li>If not we will write it to the output file and add it to the list of records that have been encountered already</li> </ol>

Removing duplicate rows from a csv file using a python script

Tags:

python

file-io

csv

I have downloaded a CSV file from Hotmail, but it has a lot of duplicates in it. These duplicates are complete copies and I don't know why my phone created them.

I want to get rid of the duplicates.

Technical specification:

Windows XP SP 3
Python 2.7
CSV file with 400 contacts

497

asked Apr 01 '13 10:04

IcyFlame

3 Answers

UPDATE: 2016

If you are happy to use the helpful more_itertools external library:

from more_itertools import unique_everseen
with open('1.csv', 'r') as f, open('2.csv', 'w') as out_file:
    out_file.writelines(unique_everseen(f))

A more efficient version of @IcyFlame's solution

with open('1.csv', 'r') as in_file, open('2.csv', 'w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        if line in seen: continue # skip duplicate

        seen.add(line)
        out_file.write(line)

To edit the same file in-place you could use this (Old Python 2 code)

import fileinput
seen = set() # set for fast O(1) amortized lookup
for line in fileinput.FileInput('1.csv', inplace=1):
    if line in seen: continue # skip duplicate

    seen.add(line)
    print line, # standard output is now redirected to the file

answered Oct 16 '22 21:10

jamylak

You can efficiently remove duplicates using Pandas, which can be installed with pip, or comes installed with the Anaconda distribution of python.

See pandas.DataFrame.drop_duplicates

pip install pandas

The code

import pandas as pd
file_name = "my_file_with_dupes.csv"
file_name_output = "my_file_without_dupes.csv"

df = pd.read_csv(file_name, sep="\t or ,")

# Notes:
# - the `subset=None` means that every column is used 
#    to determine if two rows are different; to change that specify
#    the columns as an array
# - the `inplace=True` means that the data structure is changed and
#   the duplicate rows are gone  
df.drop_duplicates(subset=None, inplace=True)

# Write the results to a different file
df.to_csv(file_name_output, index=False)

For encoding issues, set encoding=... with the appropriate type from python Standard Encodings.

See Import CSV file as a pandas DataFrame for more details about pd.read_csv

answered Oct 16 '22 21:10

Andrei Sura

You can use the following script:

pre-condition:

1.csv is the file that consists the duplicates
2.csv is the output file that will be devoid of the duplicates once this script is executed.

code



inFile = open('1.csv','r')

outFile = open('2.csv','w')

listLines = []

for line in inFile:

    if line in listLines:
        continue

    else:
        outFile.write(line)
        listLines.append(line)

outFile.close()

inFile.close()

Algorithm Explanation

Here, what I am doing is:

opening a file in the read mode. This is the file that has the duplicates.
Then in a loop that runs till the file is over, we check if the line has already encountered.
If it has been encountered than we don't write it to the output file.
If not we will write it to the output file and add it to the list of records that have been encountered already

answered Oct 16 '22 20:10

IcyFlame

Related questions
                            
                                Search inside ipython history
                            
                                Django 1.8 sending mail using gmail SMTP
                            
                                Linear time v.s. Quadratic time
                            
                                PyCharm: Py_Initialize: can't initialize sys standard streams
                            
                                Disable warnings in jupyter notebook
                            
                                How to use schemas in Django?
                            
                                Write file to a directory that doesn't exist
                            
                                How to access a file's properties on Windows?
                            
                                RegExp match repeated characters
                            
                                python: are property fields being cached automatically?
                            
                                How to extract a single value from JSON response?
                            
                                How To Run Selenium With Chrome In Docker
                            
                                test a function called twice in python
                            
                                where does django install in ubuntu
                            
                                How to do multiple imports in Python?
                            
                                Using MySQL in Flask
                            
                                Unable to apply methods on timestamps using Series built-ins
                            
                                Always including the user in the django template context
                            
                                In Python, how do I obtain the current frame?
                            
                                Why should functions always return the same type?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Removing duplicate rows from a csv file using a python script

Tags:

python

file-io

csv

IcyFlame

People also ask

3 Answers

jamylak

Andrei Sura

IcyFlame

Recent Activity

Donate For Us