Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas append perfomance concat/append using "larger" DataFrames

Tags:

python

pandas

The problem: I have data stored in csv file with the following columns data/id/value. I have 15 files each containing around 10-20mio rows. Each csv file covers a distinct period so the time indexes are non overlapping, but the columns are (new ids enter from time to time, old ones disappear). What I originally did was running the script without the pivot call, but then I run into memory issues on my local machine (only 8GB). Since there is lots of redundancy in each file, pivot seemd at first a nice way out (roughly 2/3 less data) but now perfomance kicks in. If I run the following script the concat function will run "forever" (I always interrupted manually so far after some time (2h>)). Concat/append seem to have limitations in terms of size (I have roughly 10000-20000 columns), or do I miss something here? Any suggestions?

import pandas as pd
path = 'D:\\'
data = pd.DataFrame()
#loop through list of raw file names
for file in raw_files:
    data_tmp = pd.read_csv(path + file, engine='c',
                           compression='gzip',
                           low_memory=False,
                           usecols=['date', 'Value', 'ID'])
    data_tmp = data_tmp.pivot(index='date', columns='ID',
                              values='Value')

    data = pd.concat([data,data_tmp])
    del data_tmp

EDIT I:To clarify, each csv file has about 10-20mio rows and three columns, after pivot is applied this reduces to about 2000 rows but leads to 10000 columns.

I can solve the memory issue by simply splitting the full-set of ids into subsets and run the needed calculations based on each subset as they are independent for each id. I know it makes me reload the same files n-times, where n is the number of subsets used, but this is still reasonable fast. I still wonder why append is not performing.

EDIT II: I have tried to recreate the file structure with a simulation, which is as close as possible to the actual data structure. I hope it is clear, I didn't spend to much time minimizing simulation-time, but it runs reasonable fast on my machine.

import string
import random
import pandas as pd
import numpy as np
import math

# Settings :-------------------------------
num_ids = 20000
start_ids = 4000
num_files = 10
id_interval = int((num_ids-start_ids)/num_files)
len_ids = 9
start_date = '1960-01-01'
end_date = '2014-12-31'
run_to_file = 2
# ------------------------------------------

# Simulation column IDs
id_list = []
# ensure unique elements are of size >num_ids
for x in range(num_ids + round(num_ids*0.1)):
    id_list.append(''.join(
        random.choice(string.ascii_uppercase + string.digits) for _
        in range(len_ids)))
id_list = set(id_list)
id_list = list(id_list)[:num_ids]

time_index = pd.bdate_range(start_date,end_date,freq='D')
chunk_size =  math.ceil(len(time_index)/num_files)

data = []
#  Simulate files
for file in range(0, run_to_file):
    tmp_time = time_index[file * chunk_size:(file + 1) * chunk_size]
    # TODO not all cases cover, make sure ints are obtained
    tmp_ids = id_list[file * id_interval:
        start_ids + (file + 1) * id_interval]

    tmp_data = pd.DataFrame(np.random.standard_normal(
        (len(tmp_time), len(tmp_ids))), index=tmp_time,
        columns=tmp_ids)

    tmp_file = tmp_data.stack().sortlevel(1).reset_index()
    # final simulated data structure of the parsed csv file
    tmp_file = tmp_file.rename(columns={'level_0': 'Date', 'level_1':
                                        'ID', 0: 'Value'})

    # comment/uncomment if pivot takes place on aggregate level or not
    tmp_file = tmp_file.pivot(index='Date', columns='ID',
                              values='Value')
    data.append(tmp_file)

data = pd.concat(data)
# comment/uncomment if pivot takes place on aggregate level or not
# data = data.pivot(index='Date', columns='ID', values='Value')
like image 560
MMCM_ Avatar asked Dec 04 '22 02:12

MMCM_


1 Answers

Using your reproducible example code, I can indeed confirm that the concat of only two dataframes takes a very long time. However, if you first align them (make the column names equal), then concatting is very fast:

In [94]: df1, df2 = data[0], data[1]

In [95]: %timeit pd.concat([df1, df2])
1 loops, best of 3: 18min 8s per loop

In [99]: %%timeit
   ....: df1b, df2b = df1.align(df2, axis=1)
   ....: pd.concat([df1b, df2b])
   ....:
1 loops, best of 3: 686 ms per loop

The result of both approaches is the same.
The aligning is equivalent to:

common_columns = df1.columns.union(df2.columns)
df1b = df1.reindex(columns=common_columns)
df2b = df2.reindex(columns=common_columns)

So this is probably the easier way to use when having to deal with a full list of dataframes.

The reason that pd.concat is slower is because it does more. E.g. when the column names are not equal, it checks for every column if the dtype has to be upcasted or not to hold the NaN values (which get introduced by aligning the column names). By aligning yourself, you skip this. But in this case, where you are sure to have all the same dtype, this is no problem.
That it is so much slower surprises me as well, but I will raise an issue about that.

like image 187
joris Avatar answered Feb 16 '23 00:02

joris