Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the fastest way of reading data from a text file and allocating it to a data frame?

I want to create a multi-index DataFrame by reading a textfile. Is it faster to create the multi-index and then allocate data to it from the text file using df.loc[[],[]], or concatenate rows to the DataFrame and set the index of the DataFrame at the end? Or, is it faster to use a list or dict to store the data as it's read from the file, and then create a DataFrame from them? Is there a more pythonic or faster option?

Example text file:

A = 1
 B = 1
  C data
  0 1
  1 2
A = 1
 B = 2
  C data
  1 3
  2 4
A = 2
 B = 1
  C data
  0 5
  2 6

Output DataFrame:

A B C data
1 1 0 1
    1 2
1 2 1 3
    2 4
2 1 0 5
    2 6

Update Jan 18: This is linked to How to parse complex text files using Python? I also wrote a blog article explaining how to parse complex files to beginners.

like image 221
bluprince13 Avatar asked Mar 16 '17 09:03

bluprince13


2 Answers

Element by element lookups in pandas is an expensive operation, so is aligning by index. I would read everything into arrays, create a DataFrame of values, and then set the hierarchical index directly. Usually much faster if you can avoid append or lookups.

Here is a sample result assuming you have a dataset 2-D array with everything lumped in:

In [106]: dataset
Out[106]: 
array([[1, 1, 0, 1],
       [1, 1, 1, 2],
       [1, 2, 1, 3],
       [1, 2, 2, 4],
       [2, 1, 0, 5],
       [2, 1, 2, 6]])

In [107]: pd.DataFrame(dataset,columns=['A','B','C', 'data']).set_index(['A', 'B', 'C'])
     ...: 
Out[107]: 
       data
A B C      
1 1 0     1
    1     2
  2 1     3
    2     4
2 1 0     5
    2     6

In [108]: data_values = dataset[:, 3] 
     ...: data_index = pd.MultiIndex.from_arrays( dataset[:,:3].T, names=list('ABC'))
     ...: pd.DataFrame(data_values, columns=['data'], index=data_index)
     ...: 
Out[108]: 
       data
A B C      
1 1 0     1
    1     2
  2 1     3
    2     4
2 1 0     5
    2     6

In [109]: %timeit pd.DataFrame(dataset,columns=['A','B','C', 'data']).set_index(['A', 'B', 'C'])
%%timeit
1000 loops, best of 3: 1.75 ms per loop

In [110]: %%timeit
     ...: data_values = dataset[:, 3] 
     ...: data_index = pd.MultiIndex.from_arrays( dataset[:,:3].T, names=list('ABC'))
     ...: pd.DataFrame(data_values, columns=['data'], index=data_index)
     ...: 
1000 loops, best of 3: 642 µs per loop
like image 138
clocker Avatar answered Nov 11 '22 02:11

clocker


Parsing the text file will be the bulk of your processing overhead.

If speed is the main concern I'd suggest using pickle or shelve to store the DataFrame object in a binary file ready for use.

If you need to use the text file for any reason, a separate module could be written for translating between formats.

like image 29
Scott Schlechtleitner Avatar answered Nov 11 '22 02:11

Scott Schlechtleitner