What's the fastest way of reading data from a text file and allocating it to a data frame?

Question

I want to create a multi-index DataFrame by reading a textfile. Is it faster to create the multi-index and then allocate data to it from the text file using df.loc[[],[]], or concatenate rows to the DataFrame and set the index of the DataFrame at the end? Or, is it faster to use a list or dict to store the data as it's read from the file, and then create a DataFrame from them? Is there a more pythonic or faster option?

Example text file:

A = 1
 B = 1
  C data
  0 1
  1 2
A = 1
 B = 2
  C data
  1 3
  2 4
A = 2
 B = 1
  C data
  0 5
  2 6

Output DataFrame:

Update Jan 18: This is linked to How to parse complex text files using Python? I also wrote a blog article explaining how to parse complex files to beginners.

clocker · Accepted Answer

Element by element lookups in pandas is an expensive operation, so is aligning by index. I would read everything into arrays, create a DataFrame of values, and then set the hierarchical index directly. Usually much faster if you can avoid append or lookups.

Here is a sample result assuming you have a dataset 2-D array with everything lumped in:

In [106]: dataset
Out[106]: 
array([[1, 1, 0, 1],
       [1, 1, 1, 2],
       [1, 2, 1, 3],
       [1, 2, 2, 4],
       [2, 1, 0, 5],
       [2, 1, 2, 6]])

In [107]: pd.DataFrame(dataset,columns=['A','B','C', 'data']).set_index(['A', 'B', 'C'])
     ...: 
Out[107]: 
       data
A B C      
1 1 0     1
    1     2
  2 1     3
    2     4
2 1 0     5
    2     6

In [108]: data_values = dataset[:, 3] 
     ...: data_index = pd.MultiIndex.from_arrays( dataset[:,:3].T, names=list('ABC'))
     ...: pd.DataFrame(data_values, columns=['data'], index=data_index)
     ...: 
Out[108]: 
       data
A B C      
1 1 0     1
    1     2
  2 1     3
    2     4
2 1 0     5
    2     6

In [109]: %timeit pd.DataFrame(dataset,columns=['A','B','C', 'data']).set_index(['A', 'B', 'C'])
%%timeit
1000 loops, best of 3: 1.75 ms per loop

In [110]: %%timeit
     ...: data_values = dataset[:, 3] 
     ...: data_index = pd.MultiIndex.from_arrays( dataset[:,:3].T, names=list('ABC'))
     ...: pd.DataFrame(data_values, columns=['data'], index=data_index)
     ...: 
1000 loops, best of 3: 642 µs per loop

Scott Schlechtleitner · Answer

Parsing the text file will be the bulk of your processing overhead.

If speed is the main concern I'd suggest using pickle or shelve to store the DataFrame object in a binary file ready for use.

If you need to use the text file for any reason, a separate module could be written for translating between formats.

What's the fastest way of reading data from a text file and allocating it to a data frame?

Tags:

performance

python

pandas

dataframe

bluprince13

2 Answers

clocker

Scott Schlechtleitner

Recent Activity

Donate For Us

What's the fastest way of reading data from a text file and allocating it to a data frame?

Tags:

performance

python

pandas

dataframe

bluprince13

2 Answers

clocker

Scott Schlechtleitner

Related questions

Recent Activity

Donate For Us