Pandas Processing Large CSV Data

Question

I am processing a Large Data Set with at least 8GB in size using pandas.

I've encountered a problem in reading the whole set so I read the file chunk by chunk.

In my understanding, chunking the whole file will create many different dataframes. So using my existing routine, this only removes the duplicate values on that certain dataframe and not the duplicates on the whole file.

I need to remove the duplicates on this whole data set based on the ['Unique Keys'] column.

I tried to use the pd.concat but I also encountered a problem with the memory so I tried to write the file on a csv file and append all the results of the dataframes on it.

After running the code, the file doesn't reduce much so I think my assumption is right that the current routine is not removing all the duplicates based on the whole data set.

I'm a newbie in Python so it would really help if someone can point me in the right direction.

def removeduplicates(filename):
    CHUNK_SIZE = 250000
    df_iterator = pd.read_csv(filename, na_filter=False, chunksize=CHUNK_SIZE,
                                      low_memory=False)
    # new_df = pd.DataFrame()
    for df in df_iterator:
        df = df.dropna(subset=['Unique Keys'])
        df = df.drop_duplicates(subset=['Unique Keys'], keep='first')

        df.to_csv(join(file_path, output_name.replace(' Step-2', '') +
                       ' Step-3.csv'), mode='w', index=False, encoding='utf8')

FBruzzesi · Accepted Answer

If you can fit in memory the set of unique keys:

def removeduplicates(filename):
    CHUNK_SIZE = 250000
    df_iterator = pd.read_csv(filename, na_filter=False, 
                              chunksize=CHUNK_SIZE,
                              low_memory=False)
    # create a set of (unique) ids
    all_ids = set()

    for df in df_iterator:
        df = df.dropna(subset=['Unique Keys'])
        df = df.drop_duplicates(subset=['Unique Keys'], keep='first')

        # Filter rows with key in all_ids
        df = df.loc[~df['Unique Keys'].isin(all_ids)]

        # Add new keys to the set
        all_ids = all_ids.union(set(df['Unique Keys'].unique()))

Pandas Processing Large CSV Data

Tags:

python

pandas

dataframe

BLitE.exe

1 Answers

FBruzzesi

Recent Activity

Donate For Us

Pandas Processing Large CSV Data

Tags:

python

pandas

dataframe

BLitE.exe

1 Answers

FBruzzesi

Related questions

Recent Activity

Donate For Us