I have a csv file that too big to load to memory.I need to drop duplicated rows of the file.So I follow this way:
chunker = pd.read_table(AUTHORS_PATH, names=['Author ID', 'Author name'], encoding='utf-8', chunksize=10000000)
for chunk in chunker:
chunk.drop_duplicates(['Author ID'])
But if duplicated rows distribute in different chunk seems like above script can't get the expected results.
Is there any better way?
Remove All Duplicate Rows from Pandas DataFrame You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df. drop_duplicates(keep=False) .
Pandas drop_duplicates() method helps in removing duplicates from the Pandas Dataframe In Python.
Pandas DataFrame drop_duplicates() Method The drop_duplicates() method removes duplicate rows. Use the subset parameter if only some specified columns should be considered when looking for duplicates.
To delete rows and columns from DataFrames, Pandas uses the “drop” function. To delete a column, or multiple columns, use the name of the column(s), and specify the “axis” as 1. Alternatively, as in the example below, the 'columns' parameter has been added in Pandas which cuts out the need for 'axis'.
You could try something like this.
First, create your chunker.
chunker = pd.read_table(AUTHORS_PATH, names=['Author ID', 'Author name'], encoding='utf-8', chunksize=10000000)
Now create a set of ids:
ids = set()
Now iterate over the chunks:
for chunk in chunker:
chunk.drop_duplicates(['Author ID'])
However, now, within the body of the loop, drop also ids already in the set of known ids:
chunk = chunk[~chunk['Author ID'].isin(ids)]
Finally, still within the body of the loop, add the new ids
ids.update(chunk['Author ID'].values)
If ids
is too large to fit into main memory, you might need to use some disk-based database.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With