Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas - How to save only selected columns of a DataFrame to HDF5

I'm reading a csv sample file and store it on .h5 database. The .csv is structured as follows:

User_ID;Longitude;Latitude;Year;Month;String
267261661;-3.86580025;40.32170825;2013;12;hello world
171255468;-3.83879575;40.05035005;2013;12;hello world
343588169;-3.70759531;40.4055946;2014;2;hello world
908779052;-3.8356385;40.1249459;2013;8;hello world
289540518;-3.6723114;40.3801642;2013;11;hello world
635876313;-3.8323166;40.3379393;2012;10;hello world
175160914;-3.53687933;40.35101274;2013;12;hello world 
155029860;-3.68555076;40.47688417;2013;11;hello world

I've putting it on a .h5 store with the pandas to_hdf, selecting to pass to the .h5 only a couple of columns:

import pandas as pd

df = pd.read_csv(filename + '.csv', sep=';')

df.to_hdf('test.h5','key1',format='table',data_columns=['User_ID','Year'])

I've obtained different results in the columns stored in the .h5 file using HDFStore and read_hdf, in particular:

store = pd.HDFStore('test.h5')
>>> store
>>> <class 'pandas.io.pytables.HDFStore'>
File path: /test.h5
/key1            frame_table  (typ->appendable,nrows->8,ncols->6,indexers->[index],dc->[User_ID,Year])

which is what I expect (only the 'User_ID' and 'Year' columns stored in the database), althought the ncols->6 means that actually all the columns have been stored in the .h5 file.

If I try reading the file with pd.read_hdf:

hdf = pd.read_hdf('test.h5','key1')

and asking for the keys:

hdf.keys()
>>> Index([u'User_ID', u'Longitude', u'Latitude', u'Year', u'Month', u'String'], dtype='object')

which is not what I'm expected since all columns of the original .csv file are still in the .h5 database. How can I store only a selection of columns in the .h5 in order to reduce the size of the database?

Thanks for your help.

like image 671
Fabio Lamanna Avatar asked Jan 10 '15 17:01

Fabio Lamanna


People also ask

How do I select only certain columns in a DataFrame?

To select a single column, use square brackets [] with the column name of the column of interest.

How do I get certain columns from a data frame?

You can use the loc and iloc functions to access columns in a Pandas DataFrame. Let's see how. If we wanted to access a certain column in our DataFrame, for example the Grades column, we could simply use the loc function and specify the name of the column in order to retrieve it.

How do I select columns to keep in pandas?

Selecting columns based on their name This is the most basic way to select a single column from a dataframe, just put the string name of the column in brackets. Returns a pandas series. Passing a list in the brackets lets you select multiple columns at the same time.


1 Answers

just select out the columns as you write to the file.

cols_to_keep = ['User_ID', 'Year']
df.loc[:, cols_to_keep].to_hdf(...)
like image 66
Paul H Avatar answered Nov 14 '22 21:11

Paul H