Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to store the name of rows and column index in pandas DataFrame?

Tags:

python

pandas

I have a DataFrame with named rows and columns indexes:

import numpy as np
import pandas as pd

I = pd.Index(["a", "b", "c", "d"], name="rows")
C = pd.Index(["col0", "col1", "col2"], name="cols")
df = pd.DataFrame(data=np.random.rand(4, 3),
                  index=I,
                  columns=C)

I have tried to store it in several formats (Excel, CSV) but when re-reading the file, the names are lost (maybe I have missed some options). Msgpack works but it is marked as experimental so I would prefer to avoid it for now. I would also prefer to avoid pickle. Is there any way (format and option) to store the name of the 2 indexes?

EDIT: I know how to write and read CSV with pandas. The problem is to save the name of the column index and of the row index.

like image 527
Mathieu Dubois Avatar asked Jan 27 '16 20:01

Mathieu Dubois


2 Answers

You can use hdf.

import numpy as np
import pandas as pd
I = pd.Index(["a", "b", "c", "d"], name="rows")
C = pd.Index(["col0", "col1", "col2"], name="columns")
df = pd.DataFrame(data=np.random.rand(4,3), index=I, columns=C)
print(df)

columns      col0      col1      col2
rows                                 
a        0.098497  0.918954  0.642800
b        0.168266  0.678434  0.455059
c        0.434939  0.244027  0.599400
d        0.877356  0.053085  0.182661

df.to_hdf('test.hdf', 'test')
print(pd.read_hdf('test.hdf'))

columns      col0      col1      col2
rows                                 
a        0.098497  0.918954  0.642800
b        0.168266  0.678434  0.455059
c        0.434939  0.244027  0.599400
d        0.877356  0.053085  0.182661
like image 112
Stop harming Monica Avatar answered Sep 28 '22 02:09

Stop harming Monica


You can export the DataFrame to a csv-file using .to_csv() and read it back in using .read_csv(). I extended the code you already had as follows:

#!/usr/bin/env python3
# coding: utf-8

import numpy as np
import pandas as pd

I = pd.Index(["a", "b", "c", "d"], "rows")
C = pd.Index(["col0", "col1", "col2"], "cols")
df = pd.DataFrame(data=np.random.rand(4,3), index=I, columns=C)

# export DataFrame to csv
df.to_csv('out.csv')

# set index_col in order to read first column as indices
df_in = pd.read_csv('out.csv', index_col=0)

So the DataFrame df looks like this:

       col0      col1      col2
a  0.590016  0.834033  0.535310
b  0.421589  0.897302  0.029500
c  0.373580  0.109005  0.239181
d  0.473872  0.075918  0.751628

The csv-file out.csv looks like this:

,col0,col1,col2
a,0.5900160748408918,0.8340332218911729,0.5353103406507513
b,0.42158899389955884,0.8973015040807538,0.029500416731096046
c,0.37357951184145965,0.10900495955642386,0.2391805787788026
d,0.47387186813644167,0.07591794371425187,0.7516279365972057

Reading the data back in leads to the DataFrame df_in as follows:

       col0      col1      col2
a  0.590016  0.834033  0.535310
b  0.421589  0.897302  0.029500
c  0.373580  0.109005  0.239181
d  0.473872  0.075918  0.751628

So df2 is exactly the same as df which shows that export and the desired import is working as expected.

EDIT to export column and index names:

df.to_csv('out.csv', index_label=[df.index.name, df.columns.name])

However, this makes re-importing a bit difficult since the columns name is added as a additional column. Normally, this is useful for multi-indexed data, but leads to an additional empty column here.

So I would suggest to export the index name, only:

# export DataFrame to csv
df.to_csv('out.csv', index_label=df.index.name)

# set index_col in order to read first column as indices
df_in = pd.read_csv('out.csv', index_col=0)

which leads to df_in as:

          col0      col1      col2
rows                              
a     0.442467  0.959260  0.626502
b     0.639044  0.989795  0.853002
c     0.576137  0.350260  0.532920
d     0.235698  0.095978  0.194151

I do not know why you need to export the names of both index and colums. If you simply want to access the row or column names you can get their label like this:

column_labels = df.columns.get_values()
>>> array(['col0', 'col1', 'col2'], dtype=object)

index_labels = df.index.get_values()
>>> array(['a', 'b', 'c', 'd'], dtype=object)
like image 32
albert Avatar answered Sep 28 '22 02:09

albert