I have multiple huge tsv files that I'm trying to process using pandas. I want to group by 'col3' and 'col5'. I've tried this:
import pandas as pd
df = pd.read_csv('filename.txt', sep = "\t")
g2 = df.drop_duplicates(['col3', 'col5'])
g3 = g2.groupby(['col3', 'col5']).size().sum(level=0)
print g3
It works fine so far and prints an output like this:
yes 2
no 2
I'd like to be able to aggregate the output from multiple files, i.e., to be able to group by these two columns in all the files at once and print one common output with total number of occurrences of 'yes' or 'no' or whatever that attribute could be. In other words, I'd now like to use groupby on multiple files at once. And if a file doesn't have one of these columns, it should be skipped and should go to the next file.
This is a nice use case for blaze
.
Here's an example using a couple of reduced files from the nyctaxi dataset. I've purposely split a single large file into two files of 1,000,000 lines each:
In [16]: from blaze import Data, compute, by
In [17]: ls
trip10.csv trip11.csv
In [18]: d = Data('*.csv')
In [19]: expr = by(d[['passenger_count', 'medallion']], avg_time=d.trip_time_in_secs.mean())
In [20]: %time result = compute(expr)
CPU times: user 3.22 s, sys: 393 ms, total: 3.61 s
Wall time: 3.6 s
In [21]: !du -h *
194M trip10.csv
192M trip11.csv
In [22]: len(d)
Out[22]: 2000000
In [23]: result.head()
Out[23]:
passenger_count medallion avg_time
0 0 08538606A68B9A44756733917323CE4B 0
1 0 0BB9A21E40969D85C11E68A12FAD8DDA 15
2 0 9280082BB6EC79247F47EB181181D1A4 0
3 0 9F4C63E44A6C97DE0EF88E537954FC33 0
4 0 B9182BF4BE3E50250D3EAB3FD790D1C9 14
Note: This will perform the computation with pandas, using pandas' own chunked CSV reader. If your files are in the GB range you're better off converting to a format such as bcolz or PyTables, as these are binary formats and designed for data analysis on huge files. CSVs are justs blobs of text with conventions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With