Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split big csv file by the value of a column in python

Tags:

python

pandas

I have a csv large file that I cannot handle in memory with python. I am splitting it into multiple chunks after grouping by the value of a specific column, using the following logic:

def splitDataFile(self, data_file):

    self.list_of_chunk_names = []
    csv_reader = csv.reader(open(data_file, "rb"), delimiter="|")
    columns = csv_reader.next()

    for key,rows in groupby(csv_reader, lambda row: (row[1])):
        file_name = "data_chunk"+str(key)+".csv"
        self.list_of_chunk_names.append(file_name)

        with open(file_name, "w") as output:
            output.write("|".join(columns)+"\n")
            for row in rows:
                output.write("|".join(row)+"\n")

    print "message: list of chunks ", self.list_of_chunk_names

    return

The logic is working but it's slow. I am wondering how can I optimize this? For instance with pandas?

Edit

Further explanation: I am not looking for a simple splitting to same size chunks (like each one having 1000 rows), I want to split by the value of a column, that's why I am using groupby.

like image 440
Mohamed Ali JAMAOUI Avatar asked Nov 09 '15 12:11

Mohamed Ali JAMAOUI


People also ask

How do I split a large CSV file in Python?

Step 1 (Using Pandas): Find the number of rows from the files. Step 1 (Using Traditional Python): Find the number of rows from the files. Step 2: User to input the number of lines per file (Range) and generate a random number. In case you want an equal split, provide the same number for max and min.

How do you split a large file in Python?

Example 1: Using the splitlines() the read() method reads the data from the file which is stored in the variable file_data. splitlines() method splits the data into lines and returns a list object.

How do I split a large CSV file into 2?

CSV SplitterOpen the CSV splitter and enter the CSV file you want to split. Enter the number of rows and finally click on Execute.


3 Answers

Use this Python 3 program:

 #!/usr/bin/env python3
 import binascii
 import csv
 import os.path
 import sys
 from tkinter.filedialog import askopenfilename, askdirectory
 from tkinter.simpledialog import askinteger

 def split_csv_file(f, dst_dir, keyfunc):
     csv_reader = csv.reader(f)
     csv_writers = {}
     for row in csv_reader:
         k = keyfunc(row)
         if k not in csv_writers:
             csv_writers[k] = csv.writer(open(os.path.join(dst_dir, k),
                                              mode='w', newline=''))
         csv_writers[k].writerow(row)

 def get_args_from_cli():
     input_filename = sys.argv[1]
     column = int(sys.argv[2])
     dst_dir = sys.argv[3]
     return (input_filename, column, dst_dir)

 def get_args_from_gui():
     input_filename = askopenfilename(
         filetypes=(('CSV', '.csv'),),
         title='Select CSV Input File')
     column = askinteger('Choose Table Column', 'Table column')
     dst_dir = askdirectory(title='Select Destination Directory')
     return (input_filename, column, dst_dir)

 if __name__ == '__main__':
     if len(sys.argv) == 1:
         input_filename, column, dst_dir = get_args_from_gui()
     elif len(sys.argv) == 4:
         input_filename, column, dst_dir = get_args_from_cli()
     else:
         raise Exception("Invalid number of arguments")
     with open(input_filename, mode='r', newline='') as f:
         split_csv_file(f, dst_dir, lambda r: r[column-1]+'.csv')
         # if the column has funky values resulting in invalid filenames
         # replace the line from above with:
         # split_csv_file(f, dst_dir, lambda r: binascii.b2a_hex(r[column-1].encode('utf-8')).decode('utf-8')+'.csv')

Save it as split-csv.py and run it from Explorer or from the command line.

For example to split superuser.csv based off column 1 and write the output files under dstdir use:

 python split-csv.py data.csv 1 dstdir

If you run it without arguments, a Tkinter based GUI will prompt you to choose the input file, the column (1 based index) and the destination directory.

ref

like image 140
Assem Avatar answered Oct 21 '22 06:10

Assem


I am going with something like the following, where I am iterating over the unique values of the column to split by, to filter the data chunks.

def splitWithPandas(data_file, split_by_column):
        values_to_split_by = pd.read_csv(data_file, delimiter="|", usecols=[split_by_column])
        values_to_split_by.drop_duplicates()
        values_to_split_by = pd.unique(values_to_split_by.values.ravel())

        for i in values_to_split_by:
            iter_csv = pd.read_csv(data_file, delimiter="|", chunksize=100000)
            df = pd.concat([chunk[chunk[split_by_column] == i] for chunk in iter_csv])
            df.to_csv("data_chunk_"+i, sep="|", index=False)
like image 23
Mohamed Ali JAMAOUI Avatar answered Oct 21 '22 06:10

Mohamed Ali JAMAOUI


I suspect that your biggest bottleneck is opening and closing a file handle every time you process a new block of rows. A better approach, as long as the number of files you write to is not too large, is to keep all the files open. Here's an outline:

def splitDataFile(self, data_file):
    open_files = dict()
    input_file = open(data_file, "rb")
    try:
        ...
        csv_reader = csv.reader(input_file, ...)
        ...
        for key, rows in groupby(csv_reader, lambda row: (row[1])):
            ...
            try:
                output = open_files[key]
            except KeyError:
                output = open(file_name, "w")
            output.write(...)
            ...
    finally:
        for open_file in open_files.itervalues():
            open_file.close()
        input_file.close()

Of course, if you only have one group with any given key, this will not help. (Actually it may make things worse, because you wind up holding a bunch of files open unnecessarily.) The more often you wind up writing to a single file, the more of a benefit you'll get from this change.

You can combine this with pandas, if you want, and use the chunking features of read_csv or read_table to handle the input processing.

like image 32
David Z Avatar answered Oct 21 '22 07:10

David Z