How does one append large amounts of data to a Pandas HDFStore and get a natural unique index?

Tags:

I'm importing large amounts of http logs (80GB+) into a Pandas HDFStore for statistical processing. Even within a single import file I need to batch the content as I load it. My tactic thus far has been to read the parsed lines into a DataFrame then store the DataFrame into the HDFStore. My goal is to have the index key unique for a single key in the DataStore but each DataFrame restarts it's own index value again. I was anticipating HDFStore.append() would have some mechanism to tell it to ignore the DataFrame index values and just keep adding to my HDFStore key's existing index values but cannot seem to find it. How do I import DataFrames and ignore the index values contained therein while having the HDFStore increment its existing index values? Sample code below batches every 10 lines. Naturally the real thing would be larger.

if hd_file_name:
        """
        HDF5 output file specified.
        """

        hdf_output = pd.HDFStore(hd_file_name, complib='blosc')
        print hdf_output

        columns = ['source', 'ip', 'unknown', 'user', 'timestamp', 'http_verb', 'path', 'protocol', 'http_result', 
                   'response_size', 'referrer', 'user_agent', 'response_time']

        source_name = str(log_file.name.rsplit('/')[-1])   # HDF5 Tables don't play nice with unicode so explicit str(). :(

        batch = []

        for count, line in enumerate(log_file,1):
            data = parse_line(line, rejected_output = reject_output)

            # Add our source file name to the beginning.
            data.insert(0, source_name )    
            batch.append(data)

            if not (count % 10):
                df = pd.DataFrame( batch, columns = columns )
                hdf_output.append(KEY_NAME, df)
                batch = []

        if (count % 10):
            df = pd.DataFrame( batch, columns = columns )
            hdf_output.append(KEY_NAME, df)

344

asked Jun 08 '13 07:06

Ben Scherrey

1 Answers

You can do it like this. Only trick is that the first time the store table doesn't exist, so get_storer will raise.

import pandas as pd
import numpy as np
import os

files = ['test1.csv','test2.csv']
for f in files:
    pd.DataFrame(np.random.randn(10,2),columns=list('AB')).to_csv(f)

path = 'test.h5'
if os.path.exists(path):
    os.remove(path)

with pd.get_store(path) as store:
    for f in files:
        df = pd.read_csv(f,index_col=0)
        try:
            nrows = store.get_storer('foo').nrows
        except:
            nrows = 0

        df.index = pd.Series(df.index) + nrows
        store.append('foo',df)


In [10]: pd.read_hdf('test.h5','foo')
Out[10]: 
           A         B
0   0.772017  0.153381
1   0.304131  0.368573
2   0.995465  0.799655
3  -0.326959  0.923280
4  -0.808376  0.449645
5  -1.336166  0.236968
6  -0.593523 -0.359080
7  -0.098482  0.037183
8   0.315627 -1.027162
9  -1.084545 -1.922288
10  0.412407 -0.270916
11  1.835381 -0.737411
12 -0.607571  0.507790
13  0.043509 -0.294086
14 -0.465210  0.880798
15  1.181344  0.354411
16  0.501892 -0.358361
17  0.633256  0.419397
18  0.932354 -0.603932
19 -0.341135  2.453220

You actually don't necessarily need a global unique index, (unless you want one) as HDFStore (through PyTables) provides one by uniquely numbering rows. You can always add these selection parameters.

In [11]: pd.read_hdf('test.h5','foo',start=12,stop=15)
Out[11]: 
           A         B
12 -0.607571  0.507790
13  0.043509 -0.294086
14 -0.465210  0.880798

169

answered Oct 11 '22 16:10

Jeff

Related questions
                            
                                codility absolute distinct count from an array
                            
                                Python Cut Example
                            
                                How does extending classes (Monkey Patching) work in Python?
                            
                                How to use the win32gui module with Python?
                            
                                get the DST boundaries of a given timezone in python
                            
                                Filter directory when using shutil.copytree?
                            
                                How to trigger authenticated Jenkins job with file parameter using standard Python library
                            
                                Identify contiguous regions in 2D numpy array
                            
                                How can I open UTF-16 files on Python 2.x?
                            
                                Accessing class variables via instance
                            
                                use slugify in template
                            
                                Python multiprocessing keyword arguments
                            
                                Check if a directory exists in a zip file with Python
                            
                                Handling directories with spaces Python subprocess.call()
                            
                                Python: How to check if a string is a valid IRI?
                            
                                Understanding pandas dataframe indexing
                            
                                what does this operator means in django `reduce(operator.and_, query_list)`
                            
                                What's the most pythonic way to iterate over all the lines of multiple files?
                            
                                Python: How to check for RSS updates with feedparser and etags
                            
                                How do I fix this "TypeError: 'str' object is not callable" error?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does one append large amounts of data to a Pandas HDFStore and get a natural unique index?

Tags:

python

indexing

pandas

dataframe

hdfstore

Ben Scherrey

People also ask

1 Answers

Jeff

Recent Activity

Donate For Us