<ul> <li>I have a 250MB+ huge csv file to upload</li> <li>file format is <code>group_id, application_id, reading</code> and data could look like </li> </ul> <blockquote> <pre class="prettyprint"><code>1, a1, 0.1 1, a1, 0.2 1, a1, 0.4 1, a1, 0.3 1, a1, 0.0 1, a1, 0.9 2, b1, 0.1 2, b1, 0.2 2, b1, 0.4 2, b1, 0.3 2, b1, 0.0 2, b1, 0.9 ..... n, x, 0.3(lets say) </code></pre> </blockquote> <ul> <li>I want to divide the file based on <code>group_id</code>, so output should be n files where <code>n=group_id</code> </li> </ul> Output <blockquote> <pre class="prettyprint"><code>File 1 1, a1, 0.1 1, a1, 0.2 1, a1, 0.4 1, a1, 0.3 1, a1, 0.0 1, a1, 0.9 </code></pre> </blockquote> and <blockquote> <pre class="prettyprint"><code>File2 2, b1, 0.1 2, b1, 0.2 2, b1, 0.4 2, b1, 0.3 2, b1, 0.0 2, b1, 0.9 ..... </code></pre> </blockquote> and <blockquote> <pre class="prettyprint"><code>File n n, x, 0.3(lets say) </code></pre> </blockquote> How can I do this effectively?

<code>awk</code> is capable: <pre class="prettyprint"><code> awk -F "," '{print $0 >> ("FILE" $1)}' HUGE.csv </code></pre>

How to split a huge csv file based on content of first column?

Tags:

python

linux

unix

ubuntu

I have a 250MB+ huge csv file to upload
file format is group_id, application_id, reading and data could look like

1, a1, 0.1
1, a1, 0.2
1, a1, 0.4
1, a1, 0.3
1, a1, 0.0
1, a1, 0.9
2, b1, 0.1
2, b1, 0.2
2, b1, 0.4
2, b1, 0.3
2, b1, 0.0
2, b1, 0.9
.....
n, x, 0.3(lets say)

I want to divide the file based on group_id, so output should be n files where n=group_id

Output

File 1

1, a1, 0.1
1, a1, 0.2
1, a1, 0.4
1, a1, 0.3
1, a1, 0.0
1, a1, 0.9

and

File2
2, b1, 0.1
2, b1, 0.2
2, b1, 0.4
2, b1, 0.3
2, b1, 0.0
2, b1, 0.9
.....

and

File n
n, x, 0.3(lets say)

How can I do this effectively?

379

asked Feb 28 '12 20:02

daydreamer

3 Answers

If the file is already sorted by group_id, you can do something like:

import csv
from itertools import groupby

for key, rows in groupby(csv.reader(open("foo.csv")),
                         lambda row: row[0]):
    with open("%s.txt" % key, "w") as output:
        for row in rows:
            output.write(",".join(row) + "\n")

141

answered Oct 18 '22 19:10

Fred Foo

awk is capable:

 awk -F "," '{print $0 >> ("FILE" $1)}' HUGE.csv

answered Oct 18 '22 18:10

Zsolt Botykai

Sed one-liner:

sed -e '/^1,/wFile1' -e '/^2,/wFile2' -e '/^3,/wFile3' ... OriginalFile

The only down-side is that you need to put in n -e statements (represented by the ellipsis, which shouldn't appear in the final version). So this one-liner might be a pretty long line.

The upsides, though, are that it only makes one pass through the file, no sorting is assumed, and no python is needed. Plus, it's a one-freaking-liner!

answered Oct 18 '22 18:10

Mike

Related questions
                            
                                how to install numpy on mac [duplicate]
                            
                                Sklearn changing string class label to int
                            
                                Reading a string line by line in python [closed]
                            
                                How to choose LSTM Keras parameters?
                            
                                How to do fully connected batch norm in PyTorch?
                            
                                How do I solve the future warning -> % (min_groups, self.n_splits)), Warning) in python?
                            
                                How do you replace duplicate values with multiple unique strings in Pandas?
                            
                                What's a good way to keep track of class instance variables in Python?
                            
                                Check absolute paths in Python
                            
                                Advanced Python list comprehension
                            
                                Is it bad practice to use python's getattr extensively?
                            
                                Django Templates - Printing Comma-separated ManyToManyField, sorting results list into dict?
                            
                                python: comparing two strings
                            
                                Read flat list into multidimensional array/matrix in python
                            
                                convert string to datetime object
                            
                                Django - User, UserProfile, and Admin
                            
                                How to use Python 3 and Django with Apache?
                            
                                Python GPS Module: Reading latest GPS Data
                            
                                Python efficient way to check if very large string contains a substring
                            
                                Force python to not output a float in standard form / scientific notation / exponential form [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With