Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a fast way to convert a Pandas dataframe of columns to a list of strings?

This is somewhat the reverse of things that most people would like to do when converting between lists and dataframes.

I am looking to convert a large dataframe (10M+ rows, 20+ columns) to a list of strings, where each entry is a string representation of each row in the dataframe. I can do this using pandas' to_csv() method, but I'm wondering if there is a faster way as this is proving to be a bottleneck in my code.

Minimum working example:

import numpy as np
import pandas as pd

# Create the initial dataframe.
size = 10000000
cols = list('abcdefghijklmnopqrstuvwxyz')
df = pd.DataFrame()
for col in cols:
    df[col] = np.arange(size)
    df[col] = "%s_" % col + df[col].astype(str)

# Convert to the required list structure
ret_val = _df_.to_csv(index=False, header=False).split("\n")[:-1]

The conversion aspect of the above code takes around ~90 seconds for a dataframe of 10,000,000 rows on a single thread of my Core i9, and is highly CPU dependent. I'd love to reduce that down by an order of magnitude if at all possible.

EDIT: I'm not looking to save the data to a .csv or to a file. I'm just looking to convert the dataframe to an array of strings.

EDIT: Input/Output example with only 5 columns:

In  [1]: df.head(10)
Out [1]:    a       b       c       d       e
         0  a_0     b_0     c_0     d_0     e_0
         1  a_1     b_1     c_1     d_1     e_1
         2  a_2     b_2     c_2     d_2     e_2
         3  a_3     b_3     c_3     d_3     e_3
         4  a_4     b_4     c_4     d_4     e_4
         5  a_5     b_5     c_5     d_5     e_5
         6  a_6     b_6     c_6     d_6     e_6
         7  a_7     b_7     c_7     d_7     e_7
         8  a_8     b_8     c_8     d_8     e_8
         9  a_9     b_9     c_9     d_9     e_9

In  [2]: ret_val[:10]
Out [2]: ['a_0,b_0,c_0,d_0,e_0',
          'a_1,b_1,c_1,d_1,e_1',
          'a_2,b_2,c_2,d_2,e_2',
          'a_3,b_3,c_3,d_3,e_3',
          'a_4,b_4,c_4,d_4,e_4',
          'a_5,b_5,c_5,d_5,e_5',
          'a_6,b_6,c_6,d_6,e_6',
          'a_7,b_7,c_7,d_7,e_7',
          'a_8,b_8,c_8,d_8,e_8',
          'a_9,b_9,c_9,d_9,e_9']
like image 962
MikeFenton Avatar asked Dec 20 '25 19:12

MikeFenton


2 Answers

I get ~2.5 times speedup with multiprocessing...

import multiprocessing

# df from OPs above code available in global scope

def fn(i):
    return df[i:i+1000].to_csv(index=False, header=False).split('\n')[:-1]

with multiprocessing.Pool() as pool:
    result = []
    for a in pool.map(fn, range(0, len(df), 1000)):
        result.extend(a)

reduces overall time for 1M rows from 6.8 secs to 2.8 secs on my laptop, so would hopefully scale to more cores in an i9 CPU.

This depends on Unix fork semantics to share the dataframe with child processes, and obviously does a bit more work, but might help...

using numpy.savetxt suggestion from Massifox with multiprocessing takes this down to 2.0 seconds, just map the following function:

def fn2(i):
    with StringIO() as fd:
        np.savetxt(fd, df[i:i+N], fmt='%s', delimiter=',')
        return fd.getvalue().split('\n')[:-1]

result is otherwise basically the same

your comment that says "the dataframe is a variable within a class" can be fixed in a variety of different ways. a simple way would just be to pass the dataframe to the Pool initializer at which point it won't be picked (under Unix anyway) and stash a reference to it in a global variable somewhere. this reference can then be used by each worker processes, e.g:

def stash_df(df):
    global the_df
    the_df = df

def fn(i):
    with StringIO() as fd:
        np.savetxt(fd, the_df[i:i+N], fmt='%s', delimiter=',')
        return fd.getvalue().split('\n')[:-1]

with multiprocessing.Pool(initializer=stash_df, initargs=(df,)) as pool:
    result = []
    for a in pool.map(fn, range(0, len(df), N)):
        result.extend(a)

this will be fine as long as each Pool is used by a single dataframe

like image 68
Sam Mason Avatar answered Dec 23 '25 08:12

Sam Mason


You can try different ways to speed up data writing to disk:

  1. Writing compressed file could speedup writing up to 10x

    df.to_csv('output.csv.gz' , header=True , index=False , chunksize=100000 , compression='gzip' , encoding='utf-8')
    Choose the chunksize that works best for you.

  2. Switch to hdf format:

    df.to_hdf(r'output.h5', mode='w')

  3. According to krassowski answer, using numpy. For example, using the following df:

    df=pd.DataFrame({'A':range(1000000)}) df['B'] = df.A + 1.0 df['C'] = df.A + 2.0 df['D'] = df.A + 3.0

    Pandas to csv:

    df.to_csv('pandas_to_csv', index=False)
    On my computer, takes 6.45 s ± 1.05 s per loop (mean ± std. dev. of 7 runs, 1 loop each)`

    Numpy to csv:

    savetxt( 'numpy_savetxt', aa.values, fmt='%d,%.1f,%.1f,%.1f', header=','.join(aa.columns), comments='')
    On my computer, takes 3.38 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

  4. Using Pandaral·lel.
    Is a simple and efficient tool to parallelize your Pandas computation on all your CPUs (Linux & MacOS only). How to significantly speed up your pandas computation with only one line of code. Cool!

  5. You can think of replacing Pandas dataframe with DASK dataframe. The CSV API are very similar to pandas.

like image 25
Massifox Avatar answered Dec 23 '25 09:12

Massifox



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!