I want to create a multi-index DataFrame
by reading a textfile. Is it faster to create the multi-index and then allocate data to it from the text file using df.loc[[],[]]
, or concatenate rows to the DataFrame
and set the index of the DataFrame
at the end? Or, is it faster to use a list or dict
to store the data as it's read from the file, and then create a DataFrame
from them? Is there a more pythonic or faster option?
Example text file:
A = 1
B = 1
C data
0 1
1 2
A = 1
B = 2
C data
1 3
2 4
A = 2
B = 1
C data
0 5
2 6
Output DataFrame:
A B C data
1 1 0 1
1 2
1 2 1 3
2 4
2 1 0 5
2 6
Update Jan 18: This is linked to How to parse complex text files using Python? I also wrote a blog article explaining how to parse complex files to beginners.
Element by element lookups in pandas is an expensive operation, so is aligning by index. I would read everything into arrays, create a DataFrame of values, and then set the hierarchical index directly. Usually much faster if you can avoid append or lookups.
Here is a sample result assuming you have a dataset 2-D array with everything lumped in:
In [106]: dataset
Out[106]:
array([[1, 1, 0, 1],
[1, 1, 1, 2],
[1, 2, 1, 3],
[1, 2, 2, 4],
[2, 1, 0, 5],
[2, 1, 2, 6]])
In [107]: pd.DataFrame(dataset,columns=['A','B','C', 'data']).set_index(['A', 'B', 'C'])
...:
Out[107]:
data
A B C
1 1 0 1
1 2
2 1 3
2 4
2 1 0 5
2 6
In [108]: data_values = dataset[:, 3]
...: data_index = pd.MultiIndex.from_arrays( dataset[:,:3].T, names=list('ABC'))
...: pd.DataFrame(data_values, columns=['data'], index=data_index)
...:
Out[108]:
data
A B C
1 1 0 1
1 2
2 1 3
2 4
2 1 0 5
2 6
In [109]: %timeit pd.DataFrame(dataset,columns=['A','B','C', 'data']).set_index(['A', 'B', 'C'])
%%timeit
1000 loops, best of 3: 1.75 ms per loop
In [110]: %%timeit
...: data_values = dataset[:, 3]
...: data_index = pd.MultiIndex.from_arrays( dataset[:,:3].T, names=list('ABC'))
...: pd.DataFrame(data_values, columns=['data'], index=data_index)
...:
1000 loops, best of 3: 642 µs per loop
Parsing the text file will be the bulk of your processing overhead.
If speed is the main concern I'd suggest using pickle or shelve to store the DataFrame object in a binary file ready for use.
If you need to use the text file for any reason, a separate module could be written for translating between formats.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With