Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I slice a single CSV file into several smaller ones grouped by a field?

Tags:

python

csv

I have large data set from the World Bank Millenium Development goals as a CSV. The data is displayed like this:

Country Code   Country Name   Indicator
ABW            Aruba          % Forest coverage
ADO            Andorra        % Forest coverage
AFG            Afghanistan    % Forest coverage
...
ABW            Aruba          % Literacy rate
ADO            Andorra        % Literacy rate
AFG            Afghanistan    % Literacy rate
...
ABW            Aruba          % Another indicator
ADO            Andorra        % Another indicator
AFG            Afghanistan    % Another indicator

The file is currently 8.2MB. I'm going to program a web interface for this data, and I'd like to slice the data by country so I can make an ajax request so I can load an individual CSV for each country.

I'm lost on how to do this programmatically or with any tool. I don't necessarily need Python but it's what I know best. I don't necessarily need a complete script, a general pointer on how to approach this problem is appreciated.

The original data source I'm working with is located here:

http://duopixel.com/stack/data.csv

like image 306
methodofaction Avatar asked Dec 22 '22 01:12

methodofaction


2 Answers

One-liner:

awk -F "," 'NF>1 && NR>1 {print $0 >> ("data_" $1 ".csv"); close("data_" $1 ".csv")}' data.csv

This creates new files named data_ABW, etc., containing the appropriate information. The NR>1 part skips the header line. Then, for each line, it appends that entire line ($0) to the file named Data_$1, where $1 is replaced with the text in the first column of that line. Finally, the close statement makes sure there aren't too many open files. If you didn't have so many countries, you could get rid of this and significantly increase the speed of the command.

In answer to @Lenwood's comment below, to include the header in each output file, you can do this:

awk -F "," 'NR==1 {header=$0}; NF>1 && NR>1 {if(! files[$1]) {print header >> ("data_" $1 ".csv"); files[$1]=1}; print $0 >> ("data_" $1 ".csv"); close("data_" $1 ".csv")}' data.csv

(You may have to escape the exclamation point...) The first new part NR==1 {header=$0}; just stores the first line of the input file as the variable header. Then, the other new part if(! files[$1]) ... files[$1]=1}; uses the associative array files to keep track of all whether or not it has put the header into a given file, and if not, it puts it in there.

Note that this appends the files, so if those files already exist, they'll just get added to. Therefore, if you get new data in your main file, you'll probably want to delete those other files before you run this command again.

(In case it's not obvious, if you want the files to be named like data_Aruba you can change $1 to $2.)

like image 84
Mike Avatar answered May 15 '23 03:05

Mike


You can use Python's csv module and itertools.groupby.
The following example was tested on Python 2.7.1
Edit: updated answer to account for new information added to question.

import csv, itertools as it, operator as op

csv_contents = []
with open('yourfile.csv', 'rb') as fin:
  dict_reader = csv.DictReader(fin)   # default delimiter is comma
  fieldnames = dict_reader.fieldnames # save for writing
  for line in dict_reader:            # read in all of your data
    csv_contents.append(line)         # gather data into a list (of dicts)

# input to itertools.groupby must be sorted by the grouping value 
sorted_csv_contents = sorted(csv_contents, key=op.itemgetter('Country Name'))

for groupkey, groupdata in it.groupby(sorted_csv_contents, 
                                      key=op.itemgetter('Country Name')):
  with open('slice_{:s}.csv'.format(groupkey), 'wb') as fou:
    dict_writer = csv.DictWriter(fou, fieldnames=fieldnames)
    dict_writer.writeheader()         # new method in 2.7; use writerow() in 2.6-
    dict_writer.writerows(groupdata)

Other notes:

  • You could use a regular csv reader and writer, but the DictReader and DictWriter are nice because you can reference columns by name.
  • Always use the 'b' flag when reading or writing .csv files because on Windows that makes a difference in how line-endings are handled.
  • If anything isn't clear let me know and I'll explain further!
like image 25
mechanical_meat Avatar answered May 15 '23 04:05

mechanical_meat