Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Pandas how do I deduplicate a file being read in chunks?

I have a large fixed width file being read into pandas in chunks of 10000 lines. This works great for everything except removing duplicates from the data because the duplicates can obviously be in different chunks. The file is being read in chunks because it is too large to fit into memory in its entirety.

My first attempt at deduplicating the file was to bring in just the two columns needed to deduplicate it and make a list of rows to not read. Reading in just those two columns (out of about 500) easily fits in memory and I was able to use the id column to find duplicates and an eligibility column to decide which of the two or three with the same id to keep. I then used the skiprows flag of the read_fwf() command to skip those rows.

The problem I ran into is that the Pandas fixed width file reader doesn't work with skiprows = [list] and iterator = True at the same time.

So, how do I deduplicate a file being processed in chunks?

like image 639
Gregory Arenius Avatar asked Jun 04 '15 17:06

Gregory Arenius


People also ask

How do you read data in Panda chunks?

To read large CSV files in chunks in Pandas, use the read_csv(~) method and specify the chunksize parameter. This is particularly useful if you are facing a MemoryError when trying to read in the whole DataFrame at once.

How do Pandas handle duplicates?

We can use Pandas built-in method drop_duplicates() to drop duplicate rows. Note that we started out as 80 rows, now it's 77. By default, this method returns a new DataFrame with duplicate rows removed. We can set the argument inplace=True to remove duplicates from the original DataFrame.

Can Pandas have duplicate index?

Indicate duplicate index values. Duplicated values are indicated as True values in the resulting array. Either all duplicates, all except the first, or all except the last occurrence of duplicates can be indicated. The value or values in a set of duplicates to mark as missing.

What does Chunksize do in Pandas?

Sometimes, we use the chunksize parameter while reading large datasets to divide the dataset into chunks of data. We specify the size of these chunks with the chunksize parameter. This saves computational memory and improves the efficiency of the code.


1 Answers

My solution was to bring in just the columns needed to find the duplicates I want to drop and make a bitmask based on that information. Then, by knowing the chunksize and which chunk I'm on I reindex the chunk I'm on so that it matches the correct position it represents on the bitmask. Then I just pass it through the bitmask and the duplicate rows are dropped.

Bring in the entire column to deduplicate on, in this case 'id'. Then create a bitmask of the rows that AREN'T duplicates. DataFrame.duplicated() returns the rows that are duplicates and the ~ inverts that. Now we have our 'dupemask'.

dupemask = ~df.duplicated(subset = ['id'])

Then create an iterator to bring the file in in chunks. Once that is done loop over the iterator and create a new index for each chunk. This new index matches the small chunk dataframe with its position in the 'dupemask' bitmask, which we can then use to only keep the lines that aren't duplicates.

for i, df in enumerate(chunked_data_iterator):
    df.index = range(i*chunksize, i*chunksize + len(df.index))
    df = df[dupemask]

This approach only works in this case because the data is large because its so wide. It still has to read in a column in its entirety in order to work.

like image 111
Gregory Arenius Avatar answered Sep 28 '22 09:09

Gregory Arenius