Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas groupby for multiple data frames/files at once

I have multiple huge tsv files that I'm trying to process using pandas. I want to group by 'col3' and 'col5'. I've tried this:

import pandas as pd
df = pd.read_csv('filename.txt', sep = "\t")
g2 = df.drop_duplicates(['col3', 'col5'])
g3 = g2.groupby(['col3', 'col5']).size().sum(level=0)
print g3

It works fine so far and prints an output like this:

yes 2
no  2

I'd like to be able to aggregate the output from multiple files, i.e., to be able to group by these two columns in all the files at once and print one common output with total number of occurrences of 'yes' or 'no' or whatever that attribute could be. In other words, I'd now like to use groupby on multiple files at once. And if a file doesn't have one of these columns, it should be skipped and should go to the next file.

like image 841
pam Avatar asked Feb 11 '23 14:02

pam


1 Answers

This is a nice use case for blaze.

Here's an example using a couple of reduced files from the nyctaxi dataset. I've purposely split a single large file into two files of 1,000,000 lines each:

In [16]: from blaze import Data, compute, by

In [17]: ls
trip10.csv  trip11.csv

In [18]: d = Data('*.csv')

In [19]: expr = by(d[['passenger_count', 'medallion']], avg_time=d.trip_time_in_secs.mean())

In [20]: %time result = compute(expr)
CPU times: user 3.22 s, sys: 393 ms, total: 3.61 s
Wall time: 3.6 s

In [21]: !du -h *
194M    trip10.csv
192M    trip11.csv

In [22]: len(d)
Out[22]: 2000000

In [23]: result.head()
Out[23]:
   passenger_count                         medallion  avg_time
0                0  08538606A68B9A44756733917323CE4B         0
1                0  0BB9A21E40969D85C11E68A12FAD8DDA        15
2                0  9280082BB6EC79247F47EB181181D1A4         0
3                0  9F4C63E44A6C97DE0EF88E537954FC33         0
4                0  B9182BF4BE3E50250D3EAB3FD790D1C9        14

Note: This will perform the computation with pandas, using pandas' own chunked CSV reader. If your files are in the GB range you're better off converting to a format such as bcolz or PyTables, as these are binary formats and designed for data analysis on huge files. CSVs are justs blobs of text with conventions.

like image 94
Phillip Cloud Avatar answered Feb 13 '23 04:02

Phillip Cloud