I'm reading a csv sample file and store it on .h5 database. The .csv is structured as follows:
User_ID;Longitude;Latitude;Year;Month;String
267261661;-3.86580025;40.32170825;2013;12;hello world
171255468;-3.83879575;40.05035005;2013;12;hello world
343588169;-3.70759531;40.4055946;2014;2;hello world
908779052;-3.8356385;40.1249459;2013;8;hello world
289540518;-3.6723114;40.3801642;2013;11;hello world
635876313;-3.8323166;40.3379393;2012;10;hello world
175160914;-3.53687933;40.35101274;2013;12;hello world
155029860;-3.68555076;40.47688417;2013;11;hello world
I've putting it on a .h5 store with the pandas to_hdf, selecting to pass to the .h5 only a couple of columns:
import pandas as pd
df = pd.read_csv(filename + '.csv', sep=';')
df.to_hdf('test.h5','key1',format='table',data_columns=['User_ID','Year'])
I've obtained different results in the columns stored in the .h5 file using HDFStore and read_hdf, in particular:
store = pd.HDFStore('test.h5')
>>> store
>>> <class 'pandas.io.pytables.HDFStore'>
File path: /test.h5
/key1 frame_table (typ->appendable,nrows->8,ncols->6,indexers->[index],dc->[User_ID,Year])
which is what I expect (only the 'User_ID' and 'Year' columns stored in the database), althought the ncols->6 means that actually all the columns have been stored in the .h5 file.
If I try reading the file with pd.read_hdf:
hdf = pd.read_hdf('test.h5','key1')
and asking for the keys:
hdf.keys()
>>> Index([u'User_ID', u'Longitude', u'Latitude', u'Year', u'Month', u'String'], dtype='object')
which is not what I'm expected since all columns of the original .csv file are still in the .h5 database. How can I store only a selection of columns in the .h5 in order to reduce the size of the database?
Thanks for your help.
To select a single column, use square brackets [] with the column name of the column of interest.
You can use the loc and iloc functions to access columns in a Pandas DataFrame. Let's see how. If we wanted to access a certain column in our DataFrame, for example the Grades column, we could simply use the loc function and specify the name of the column in order to retrieve it.
Selecting columns based on their name This is the most basic way to select a single column from a dataframe, just put the string name of the column in brackets. Returns a pandas series. Passing a list in the brackets lets you select multiple columns at the same time.
just select out the columns as you write to the file.
cols_to_keep = ['User_ID', 'Year']
df.loc[:, cols_to_keep].to_hdf(...)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With