Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Processing Large CSV Data

I am processing a Large Data Set with at least 8GB in size using pandas.

I've encountered a problem in reading the whole set so I read the file chunk by chunk.

In my understanding, chunking the whole file will create many different dataframes. So using my existing routine, this only removes the duplicate values on that certain dataframe and not the duplicates on the whole file.

I need to remove the duplicates on this whole data set based on the ['Unique Keys'] column.

I tried to use the pd.concat but I also encountered a problem with the memory so I tried to write the file on a csv file and append all the results of the dataframes on it.

After running the code, the file doesn't reduce much so I think my assumption is right that the current routine is not removing all the duplicates based on the whole data set.

I'm a newbie in Python so it would really help if someone can point me in the right direction.

def removeduplicates(filename):
    CHUNK_SIZE = 250000
    df_iterator = pd.read_csv(filename, na_filter=False, chunksize=CHUNK_SIZE,
                                      low_memory=False)
    # new_df = pd.DataFrame()
    for df in df_iterator:
        df = df.dropna(subset=['Unique Keys'])
        df = df.drop_duplicates(subset=['Unique Keys'], keep='first')

        df.to_csv(join(file_path, output_name.replace(' Step-2', '') +
                       ' Step-3.csv'), mode='w', index=False, encoding='utf8')
like image 580
BLitE.exe Avatar asked May 27 '26 08:05

BLitE.exe


1 Answers

If you can fit in memory the set of unique keys:

def removeduplicates(filename):
    CHUNK_SIZE = 250000
    df_iterator = pd.read_csv(filename, na_filter=False, 
                              chunksize=CHUNK_SIZE,
                              low_memory=False)
    # create a set of (unique) ids
    all_ids = set()

    for df in df_iterator:
        df = df.dropna(subset=['Unique Keys'])
        df = df.drop_duplicates(subset=['Unique Keys'], keep='first')

        # Filter rows with key in all_ids
        df = df.loc[~df['Unique Keys'].isin(all_ids)]

        # Add new keys to the set
        all_ids = all_ids.union(set(df['Unique Keys'].unique()))
like image 51
FBruzzesi Avatar answered Jun 01 '26 14:06

FBruzzesi



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!