group_id, application_id, reading
and data could look like 1, a1, 0.1 1, a1, 0.2 1, a1, 0.4 1, a1, 0.3 1, a1, 0.0 1, a1, 0.9 2, b1, 0.1 2, b1, 0.2 2, b1, 0.4 2, b1, 0.3 2, b1, 0.0 2, b1, 0.9 ..... n, x, 0.3(lets say)
group_id
, so output should be n files where n=group_id
Output
File 1 1, a1, 0.1 1, a1, 0.2 1, a1, 0.4 1, a1, 0.3 1, a1, 0.0 1, a1, 0.9
and
File2 2, b1, 0.1 2, b1, 0.2 2, b1, 0.4 2, b1, 0.3 2, b1, 0.0 2, b1, 0.9 .....
and
File n n, x, 0.3(lets say)
How can I do this effectively?
Normally, CSV files use a comma to separate each specific data value. Here's what that structure looks like: column 1 name,column 2 name, column 3 name first row data 1,first row data 2,first row data 3 second row data 1,second row data 2,second row data 3 ... Notice how each piece of data is separated by a comma.
If the file is already sorted by group_id
, you can do something like:
import csv
from itertools import groupby
for key, rows in groupby(csv.reader(open("foo.csv")),
lambda row: row[0]):
with open("%s.txt" % key, "w") as output:
for row in rows:
output.write(",".join(row) + "\n")
awk
is capable:
awk -F "," '{print $0 >> ("FILE" $1)}' HUGE.csv
Sed one-liner:
sed -e '/^1,/wFile1' -e '/^2,/wFile2' -e '/^3,/wFile3' ... OriginalFile
The only down-side is that you need to put in n -e
statements (represented by the ellipsis, which shouldn't appear in the final version). So this one-liner might be a pretty long line.
The upsides, though, are that it only makes one pass through the file, no sorting is assumed, and no python is needed. Plus, it's a one-freaking-liner!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With