Split big csv file by the value of a column in python

Tags:

pandas

I have a csv large file that I cannot handle in memory with python. I am splitting it into multiple chunks after grouping by the value of a specific column, using the following logic:

def splitDataFile(self, data_file):

    self.list_of_chunk_names = []
    csv_reader = csv.reader(open(data_file, "rb"), delimiter="|")
    columns = csv_reader.next()

    for key,rows in groupby(csv_reader, lambda row: (row[1])):
        file_name = "data_chunk"+str(key)+".csv"
        self.list_of_chunk_names.append(file_name)

        with open(file_name, "w") as output:
            output.write("|".join(columns)+"\n")
            for row in rows:
                output.write("|".join(row)+"\n")

    print "message: list of chunks ", self.list_of_chunk_names

    return

The logic is working but it's slow. I am wondering how can I optimize this? For instance with pandas?

Edit

Further explanation: I am not looking for a simple splitting to same size chunks (like each one having 1000 rows), I want to split by the value of a column, that's why I am using groupby.

440

asked Nov 09 '15 12:11

3 Answers

Use this Python 3 program:

 #!/usr/bin/env python3
 import binascii
 import csv
 import os.path
 import sys
 from tkinter.filedialog import askopenfilename, askdirectory
 from tkinter.simpledialog import askinteger

 def split_csv_file(f, dst_dir, keyfunc):
     csv_reader = csv.reader(f)
     csv_writers = {}
     for row in csv_reader:
         k = keyfunc(row)
         if k not in csv_writers:
             csv_writers[k] = csv.writer(open(os.path.join(dst_dir, k),
                                              mode='w', newline=''))
         csv_writers[k].writerow(row)

 def get_args_from_cli():
     input_filename = sys.argv[1]
     column = int(sys.argv[2])
     dst_dir = sys.argv[3]
     return (input_filename, column, dst_dir)

 def get_args_from_gui():
     input_filename = askopenfilename(
         filetypes=(('CSV', '.csv'),),
         title='Select CSV Input File')
     column = askinteger('Choose Table Column', 'Table column')
     dst_dir = askdirectory(title='Select Destination Directory')
     return (input_filename, column, dst_dir)

 if __name__ == '__main__':
     if len(sys.argv) == 1:
         input_filename, column, dst_dir = get_args_from_gui()
     elif len(sys.argv) == 4:
         input_filename, column, dst_dir = get_args_from_cli()
     else:
         raise Exception("Invalid number of arguments")
     with open(input_filename, mode='r', newline='') as f:
         split_csv_file(f, dst_dir, lambda r: r[column-1]+'.csv')
         # if the column has funky values resulting in invalid filenames
         # replace the line from above with:
         # split_csv_file(f, dst_dir, lambda r: binascii.b2a_hex(r[column-1].encode('utf-8')).decode('utf-8')+'.csv')

Save it as split-csv.py and run it from Explorer or from the command line.

For example to split superuser.csv based off column 1 and write the output files under dstdir use:

 python split-csv.py data.csv 1 dstdir

If you run it without arguments, a Tkinter based GUI will prompt you to choose the input file, the column (1 based index) and the destination directory.

ref

140

answered Oct 21 '22 06:10

I suspect that your biggest bottleneck is opening and closing a file handle every time you process a new block of rows. A better approach, as long as the number of files you write to is not too large, is to keep all the files open. Here's an outline:

def splitDataFile(self, data_file):
    open_files = dict()
    input_file = open(data_file, "rb")
    try:
        ...
        csv_reader = csv.reader(input_file, ...)
        ...
        for key, rows in groupby(csv_reader, lambda row: (row[1])):
            ...
            try:
                output = open_files[key]
            except KeyError:
                output = open(file_name, "w")
            output.write(...)
            ...
    finally:
        for open_file in open_files.itervalues():
            open_file.close()
        input_file.close()

Of course, if you only have one group with any given key, this will not help. (Actually it may make things worse, because you wind up holding a bunch of files open unnecessarily.) The more often you wind up writing to a single file, the more of a benefit you'll get from this change.

You can combine this with pandas, if you want, and use the chunking features of read_csv or read_table to handle the input processing.

answered Oct 21 '22 07:10

David Z

Related questions
                            
                                Active tag on Bootstrap with Django
                            
                                import runs tests twice in pytest
                            
                                Piping to head results in broken pipe in shell script called from python
                            
                                AttributeError: 'Response' object has no attribute 'read'
                            
                                Element wise comparison between 1D and 2D array
                            
                                Where is the Python interpreter that Sublime Text uses to run plugins?
                            
                                match a regular expression with optional lookahead
                            
                                pyinstaller: change application icon
                            
                                Finding matching strings when comparing two lists
                            
                                Read console output of another program in Python
                            
                                Stop pydoc from running my Python program
                            
                                Python-Django timezone is not working properly
                            
                                Scikit-learn Random Forest out of bag sample
                            
                                Python Selenium: input textbox, send_keys not working
                            
                                How can I attach a vertical scrollbar to a treeview using Tkinter?
                            
                                How to make an optional decorator in Python
                            
                                How to merge two data frames based on nearest date
                            
                                how to make 1 by n dataframe from series in pandas?
                            
                                merging two pandas dataframes on nearest time stamp
                            
                                Python smtplib login error smtplib.SMTPException: STARTTLS extension not supported by server

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Split big csv file by the value of a column in python

Tags:

python

pandas

Mohamed Ali JAMAOUI

People also ask

3 Answers

Assem

Mohamed Ali JAMAOUI

David Z

Recent Activity

Donate For Us