Most efficient way to parse a large .csv in python?

Tags:

I tried to look on other answers but I am still not sure the right way to do this. I have a number of really large .csv files (could be a gigabyte each), and I want to first get their column labels, cause they are not all the same, and then according to user preference extract some of this columns with some criteria. Before I start the extraction part I did a simple test to see what is the fastest way to parse this files and here is my code:

def mmapUsage():
    start=time.time()
    with open("csvSample.csv", "r+b") as f:
        # memory-mapInput the file, size 0 means whole file
        mapInput = mmap.mmap(f.fileno(), 0)
        # read content via standard file methods
        L=list()
        for s in iter(mapInput.readline, ""):
            L.append(s)
        print "List length: " ,len(L)
        #print "Sample element: ",L[1]
        mapInput.close()
        end=time.time()
        print "Time for completion",end-start

def fileopenUsage():
    start=time.time()
    fileInput=open("csvSample.csv")
    M=list()
    for s in fileInput:
            M.append(s)
    print "List length: ",len(M)
    #print "Sample element: ",M[1]
    fileInput.close()
    end=time.time()
    print "Time for completion",end-start

def readAsCsv():
    X=list()
    start=time.time()
    spamReader = csv.reader(open('csvSample.csv', 'rb'))
    for row in spamReader:
        X.append(row)
    print "List length: ",len(X)
    #print "Sample element: ",X[1]
    end=time.time()
    print "Time for completion",end-start

And my results:

=======================
Populating list from Mmap
List length:  1181220
Time for completion 0.592000007629

=======================
Populating list from Fileopen
List length:  1181220
Time for completion 0.833999872208

=======================
Populating list by csv library
List length:  1181220
Time for completion 5.06700015068

So it seems that the csv library most people use is really alot slower than the others. Maybe later it proves to be faster when I start extracting data from the csv file but I cannot be sure for that yet. Any suggestions and tips before I start implementing? Thanks alot!

523

asked Jan 31 '12 21:01

spagi

2 Answers

As pointed out several other times, the first two methods do no actual string parsing, they just read a line at a time without extracting fields. I imagine the majority of the speed difference seen in CSV is due to that.

The CSV module is invaluable if you include any textual data that may include more of the 'standard' CSV syntax than just commas, especially if you're reading from an Excel format.

If you've just got lines like "1,2,3,4" you're probably fine with a simple split, but if you have lines like "1,2,'Hello, my name\'s fred'" you're going to go crazy trying to parse that without errors.

CSV will also transparently handle things like newlines in the middle of a quoted string. A simple for..in without CSV is going to have trouble with that.

The CSV module has always worked fine for me reading unicode strings if I use it like so:

f = csv.reader(codecs.open(filename, 'rU'))

It is plenty of robust for importing multi-thousand line files with unicode, quoted strings, newlines in the middle of quoted strings, lines with fields missing at the end, etc. all with reasonable read times.

I'd try using it first and only looking for optimizations on top of it if you really need the extra speed.

answered Oct 26 '22 11:10

Michael Pratt

How much do you care about sanitization?

The csv module is really good at understanding different csv file dialects and ensuring that escaping is happing properly, but it's definitely overkill and can often be way more trouble than it's worth (especially if you have unicode!)

A really naive implementation that properly escapes \, would be:

import re

def read_csv_naive():
    with open(<file_str>, 'r') as file_obj:
      return [re.split('[^\\],', x) for x in file_obj.splitlines()]

If your data is simple this will work great. If you have data that might need more escaping, the csv module is probably your most stable bet.

answered Oct 26 '22 11:10

mvanveen

Related questions
                            
                                Python regular expressions assigning to named groups
                            
                                Common coding style for Python?
                            
                                Finding blank regions in image
                            
                                How do I redefine functions in python?
                            
                                Creating a list with >255 elements
                            
                                Why does initializing a variable via a python default variable keep state across object instantiation?
                            
                                Detect inserted USB on Windows
                            
                                Python scapy import error
                            
                                Some basic clarifications about Python?
                            
                                Python: check if value is in a list no matter the CaSE
                            
                                Partitioning a string in Python by a regular expression
                            
                                How to fit polynomial to data with error bars
                            
                                Python - using subprocess to call sed?
                            
                                Tutorial or Guide for Scripting Xcode Build Phases
                            
                                SQLAlchemy ForeignKey can't find table
                            
                                Load an html5 canvas into a PIL Image with Django
                            
                                Splitting large text file by a delimiter in Python
                            
                                Python: Compute relative path from one directory to another [duplicate]
                            
                                How to implement a simple button in PyQt
                            
                                casting into a Python string from a char[] returned by a DLL

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With