I have downloaded a CSV file from Hotmail, but it has a lot of duplicates in it. These duplicates are complete copies and I don't know why my phone created them.
I want to get rid of the duplicates.
Technical specification:
Windows XP SP 3 Python 2.7 CSV file with 400 contacts
Use DataFrame. drop_duplicates() to Drop Duplicate and Keep First Rows. You can use DataFrame. drop_duplicates() without any arguments to drop rows with the same values on all columns.
Method 1: Read the csv file and pass it into the data frame. Then, identify the duplicate rows using the duplicated() function. Finally, use the print statement to display the duplicate rows.
UPDATE: 2016
If you are happy to use the helpful more_itertools
external library:
from more_itertools import unique_everseen
with open('1.csv', 'r') as f, open('2.csv', 'w') as out_file:
out_file.writelines(unique_everseen(f))
A more efficient version of @IcyFlame's solution
with open('1.csv', 'r') as in_file, open('2.csv', 'w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
if line in seen: continue # skip duplicate
seen.add(line)
out_file.write(line)
To edit the same file in-place you could use this (Old Python 2 code)
import fileinput
seen = set() # set for fast O(1) amortized lookup
for line in fileinput.FileInput('1.csv', inplace=1):
if line in seen: continue # skip duplicate
seen.add(line)
print line, # standard output is now redirected to the file
You can efficiently remove duplicates using Pandas, which can be installed with pip
, or comes installed with the Anaconda distribution of python.
See pandas.DataFrame.drop_duplicates
pip install pandas
The code
import pandas as pd
file_name = "my_file_with_dupes.csv"
file_name_output = "my_file_without_dupes.csv"
df = pd.read_csv(file_name, sep="\t or ,")
# Notes:
# - the `subset=None` means that every column is used
# to determine if two rows are different; to change that specify
# the columns as an array
# - the `inplace=True` means that the data structure is changed and
# the duplicate rows are gone
df.drop_duplicates(subset=None, inplace=True)
# Write the results to a different file
df.to_csv(file_name_output, index=False)
For encoding issues, set encoding=...
with the appropriate type from python Standard Encodings.
See Import CSV file as a pandas DataFrame for more details about pd.read_csv
You can use the following script:
pre-condition:
1.csv
is the file that consists the duplicates2.csv
is the output file that will be devoid of the duplicates once this script is executed.code
inFile = open('1.csv','r')
outFile = open('2.csv','w')
listLines = []
for line in inFile:
if line in listLines:
continue
else:
outFile.write(line)
listLines.append(line)
outFile.close()
inFile.close()
Algorithm Explanation
Here, what I am doing is:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With