Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split a huge csv file based on content of first column?

  • I have a 250MB+ huge csv file to upload
  • file format is group_id, application_id, reading and data could look like
1, a1, 0.1
1, a1, 0.2
1, a1, 0.4
1, a1, 0.3
1, a1, 0.0
1, a1, 0.9
2, b1, 0.1
2, b1, 0.2
2, b1, 0.4
2, b1, 0.3
2, b1, 0.0
2, b1, 0.9
.....
n, x, 0.3(lets say)  
  • I want to divide the file based on group_id, so output should be n files where n=group_id

Output

File 1

1, a1, 0.1
1, a1, 0.2
1, a1, 0.4
1, a1, 0.3
1, a1, 0.0
1, a1, 0.9

and

File2
2, b1, 0.1
2, b1, 0.2
2, b1, 0.4
2, b1, 0.3
2, b1, 0.0
2, b1, 0.9
.....

and

File n
n, x, 0.3(lets say)  

How can I do this effectively?

like image 379
daydreamer Avatar asked Feb 28 '12 20:02

daydreamer


People also ask

How does the CSV library separate the data within the columns?

Normally, CSV files use a comma to separate each specific data value. Here's what that structure looks like: column 1 name,column 2 name, column 3 name first row data 1,first row data 2,first row data 3 second row data 1,second row data 2,second row data 3 ... Notice how each piece of data is separated by a comma.


3 Answers

If the file is already sorted by group_id, you can do something like:

import csv
from itertools import groupby

for key, rows in groupby(csv.reader(open("foo.csv")),
                         lambda row: row[0]):
    with open("%s.txt" % key, "w") as output:
        for row in rows:
            output.write(",".join(row) + "\n")
like image 141
Fred Foo Avatar answered Oct 18 '22 19:10

Fred Foo


awk is capable:

 awk -F "," '{print $0 >> ("FILE" $1)}' HUGE.csv
like image 33
Zsolt Botykai Avatar answered Oct 18 '22 18:10

Zsolt Botykai


Sed one-liner:

sed -e '/^1,/wFile1' -e '/^2,/wFile2' -e '/^3,/wFile3' ... OriginalFile 

The only down-side is that you need to put in n -e statements (represented by the ellipsis, which shouldn't appear in the final version). So this one-liner might be a pretty long line.

The upsides, though, are that it only makes one pass through the file, no sorting is assumed, and no python is needed. Plus, it's a one-freaking-liner!

like image 34
Mike Avatar answered Oct 18 '22 18:10

Mike