I'm importing large amounts of http logs (80GB+) into a Pandas HDFStore for statistical processing. Even within a single import file I need to batch the content as I load it. My tactic thus far has been to read the parsed lines into a DataFrame then store the DataFrame into the HDFStore. My goal is to have the index key unique for a single key in the DataStore but each DataFrame restarts it's own index value again. I was anticipating HDFStore.append() would have some mechanism to tell it to ignore the DataFrame index values and just keep adding to my HDFStore key's existing index values but cannot seem to find it. How do I import DataFrames and ignore the index values contained therein while having the HDFStore increment its existing index values? Sample code below batches every 10 lines. Naturally the real thing would be larger.
if hd_file_name:
"""
HDF5 output file specified.
"""
hdf_output = pd.HDFStore(hd_file_name, complib='blosc')
print hdf_output
columns = ['source', 'ip', 'unknown', 'user', 'timestamp', 'http_verb', 'path', 'protocol', 'http_result',
'response_size', 'referrer', 'user_agent', 'response_time']
source_name = str(log_file.name.rsplit('/')[-1]) # HDF5 Tables don't play nice with unicode so explicit str(). :(
batch = []
for count, line in enumerate(log_file,1):
data = parse_line(line, rejected_output = reject_output)
# Add our source file name to the beginning.
data.insert(0, source_name )
batch.append(data)
if not (count % 10):
df = pd.DataFrame( batch, columns = columns )
hdf_output.append(KEY_NAME, df)
batch = []
if (count % 10):
df = pd.DataFrame( batch, columns = columns )
hdf_output.append(KEY_NAME, df)
Series append syntax The syntax for using append on a Series is very similar to the dataframe syntax. You type the name of the first Series, and then . append() to call the method. Then inside the parenthesis, you type the name of the second Series, which you want to append to the end of the first.
Using loc[] to Append The New List to a DataFrame. By using df. loc[index]=list you can append a list as a row to the DataFrame at a specified Index, In order to add at the end get the index of the last record using len(df) function.
append() function is used to append rows of other dataframe to the end of the given dataframe, returning a new dataframe object. Columns not in the original dataframes are added as new columns and the new cells are populated with NaN value. Parameters: other : DataFrame or Series/dict-like object, or list of these.
You can do it like this. Only trick is that the first time the store table doesn't exist, so get_storer
will raise.
import pandas as pd
import numpy as np
import os
files = ['test1.csv','test2.csv']
for f in files:
pd.DataFrame(np.random.randn(10,2),columns=list('AB')).to_csv(f)
path = 'test.h5'
if os.path.exists(path):
os.remove(path)
with pd.get_store(path) as store:
for f in files:
df = pd.read_csv(f,index_col=0)
try:
nrows = store.get_storer('foo').nrows
except:
nrows = 0
df.index = pd.Series(df.index) + nrows
store.append('foo',df)
In [10]: pd.read_hdf('test.h5','foo')
Out[10]:
A B
0 0.772017 0.153381
1 0.304131 0.368573
2 0.995465 0.799655
3 -0.326959 0.923280
4 -0.808376 0.449645
5 -1.336166 0.236968
6 -0.593523 -0.359080
7 -0.098482 0.037183
8 0.315627 -1.027162
9 -1.084545 -1.922288
10 0.412407 -0.270916
11 1.835381 -0.737411
12 -0.607571 0.507790
13 0.043509 -0.294086
14 -0.465210 0.880798
15 1.181344 0.354411
16 0.501892 -0.358361
17 0.633256 0.419397
18 0.932354 -0.603932
19 -0.341135 2.453220
You actually don't necessarily need a global unique index, (unless you want one) as HDFStore
(through PyTables
) provides one by uniquely numbering rows. You can always add these selection parameters.
In [11]: pd.read_hdf('test.h5','foo',start=12,stop=15)
Out[11]:
A B
12 -0.607571 0.507790
13 0.043509 -0.294086
14 -0.465210 0.880798
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With