pandas groupby for multiple data frames/files at once

Question

I have multiple huge tsv files that I'm trying to process using pandas. I want to group by 'col3' and 'col5'. I've tried this:

import pandas as pd
df = pd.read_csv('filename.txt', sep = "	")
g2 = df.drop_duplicates(['col3', 'col5'])
g3 = g2.groupby(['col3', 'col5']).size().sum(level=0)
print g3

It works fine so far and prints an output like this:

yes 2
no  2

I'd like to be able to aggregate the output from multiple files, i.e., to be able to group by these two columns in all the files at once and print one common output with total number of occurrences of 'yes' or 'no' or whatever that attribute could be. In other words, I'd now like to use groupby on multiple files at once. And if a file doesn't have one of these columns, it should be skipped and should go to the next file.

Phillip Cloud · Accepted Answer

This is a nice use case for blaze.

Here's an example using a couple of reduced files from the nyctaxi dataset. I've purposely split a single large file into two files of 1,000,000 lines each:

In [16]: from blaze import Data, compute, by

In [17]: ls
trip10.csv  trip11.csv

In [18]: d = Data('*.csv')

In [19]: expr = by(d[['passenger_count', 'medallion']], avg_time=d.trip_time_in_secs.mean())

In [20]: %time result = compute(expr)
CPU times: user 3.22 s, sys: 393 ms, total: 3.61 s
Wall time: 3.6 s

In [21]: !du -h *
194M    trip10.csv
192M    trip11.csv

In [22]: len(d)
Out[22]: 2000000

In [23]: result.head()
Out[23]:
   passenger_count                         medallion  avg_time
0                0  08538606A68B9A44756733917323CE4B         0
1                0  0BB9A21E40969D85C11E68A12FAD8DDA        15
2                0  9280082BB6EC79247F47EB181181D1A4         0
3                0  9F4C63E44A6C97DE0EF88E537954FC33         0
4                0  B9182BF4BE3E50250D3EAB3FD790D1C9        14

Note: This will perform the computation with pandas, using pandas' own chunked CSV reader. If your files are in the GB range you're better off converting to a format such as bcolz or PyTables, as these are binary formats and designed for data analysis on huge files. CSVs are justs blobs of text with conventions.

pandas groupby for multiple data frames/files at once

Tags:

python

pandas

csv

group-by

pam

1 Answers

Phillip Cloud

Recent Activity

Donate For Us

pandas groupby for multiple data frames/files at once

Tags:

python

pandas

csv

group-by

pam

1 Answers

Phillip Cloud

Related questions

Recent Activity

Donate For Us