Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to drop duplicated rows using pandas in a big data file?

I have a csv file that too big to load to memory.I need to drop duplicated rows of the file.So I follow this way:

chunker = pd.read_table(AUTHORS_PATH, names=['Author ID', 'Author name'],      encoding='utf-8', chunksize=10000000)

for chunk in chunker:
    chunk.drop_duplicates(['Author ID'])

But if duplicated rows distribute in different chunk seems like above script can't get the expected results.

Is there any better way?

like image 929
You Gakukou Avatar asked Sep 07 '16 09:09

You Gakukou


People also ask

How do I get rid of duplicate rows in Pandas?

Remove All Duplicate Rows from Pandas DataFrame You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df. drop_duplicates(keep=False) .

How do I remove duplicate rows from a dataset in Python?

Pandas drop_duplicates() method helps in removing duplicates from the Pandas Dataframe In Python.

How do I drop duplicates in Pandas DataFrame?

Pandas DataFrame drop_duplicates() Method The drop_duplicates() method removes duplicate rows. Use the subset parameter if only some specified columns should be considered when looking for duplicates.

How do I delete multiple rows in Pandas DataFrame?

To delete rows and columns from DataFrames, Pandas uses the “drop” function. To delete a column, or multiple columns, use the name of the column(s), and specify the “axis” as 1. Alternatively, as in the example below, the 'columns' parameter has been added in Pandas which cuts out the need for 'axis'.


1 Answers

You could try something like this.

First, create your chunker.

chunker = pd.read_table(AUTHORS_PATH, names=['Author ID', 'Author name'], encoding='utf-8', chunksize=10000000)

Now create a set of ids:

ids = set()

Now iterate over the chunks:

for chunk in chunker:
    chunk.drop_duplicates(['Author ID'])

However, now, within the body of the loop, drop also ids already in the set of known ids:

    chunk = chunk[~chunk['Author ID'].isin(ids)]

Finally, still within the body of the loop, add the new ids

    ids.update(chunk['Author ID'].values)

If ids is too large to fit into main memory, you might need to use some disk-based database.

like image 79
Ami Tavory Avatar answered Nov 01 '22 10:11

Ami Tavory