Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Analysis on CSV files 100,000 lines x 40 columns

Tags:

python

numpy

I have about a 100 csv files each 100,000 x 40 rows columns. I'd like to do some statistical analysis on it, pull out some sample data, plot general trends, do variance and R-square analysis, and plot some spectra diagrams. For now, I'm considering numpy for the analysis.

I was wondering what issues should I expect with such large files? I've already checked for erroneous data. What are your recommendations on doing statistical analysis? would it be better if I just split the files and do the whole thing in Excel?

like image 332
dassouki Avatar asked Jan 26 '10 20:01

dassouki


People also ask

How do I find the number of rows and columns in a CSV file in Python?

To get the number of rows, and columns we can use len(df. axes[]) function in Python.

How many rows and columns in CSV file can handle?

csv files have a limit of 32,767 characters per cell. Excel has a limit of 1,048,576 rows and 16,384 columns per sheet. CSV files can hold many more rows. You can read more about these limits and others from this Microsoft support article here.


2 Answers

I've found that Python + CSV is probably the fastest, and simplest way to do some kinds of statistical processing.

We do a fair amount of reformatting and correcting for odd data errors, so Python helps us.

The availability of Python's functional programming features makes this particularly simple. You can do sampling with tools like this.

def someStatFunction( source ):
    for row in source:
        ...some processing...

def someFilterFunction( source ):
    for row in source:
        if someFunction( row ):
            yield row

# All rows
with open( "someFile", "rb" )  as source:
    rdr = csv.reader( source )
    someStatFunction( rdr )

# Filtered by someFilterFunction applied to each row
with open( "someFile", "rb" )  as source:
    rdr = csv.reader( source )
    someStatFunction( someFilterFunction( rdr ) )

I really like being able to compose more complex functions from simpler functions.

like image 132
S.Lott Avatar answered Nov 15 '22 17:11

S.Lott


For massive datasets you might be interested in ROOT. It can be used to analyze and very effectively store petabytes of data. It also come with some basic and more advanced statistics tools.

While it is written to be used with C++, there are also pretty complete python bindings. They don't make it extremely easy to get direct access to the raw data (e.g. to use them in R or numpy) -- but it is definitely possible (I do it all the time).

like image 39
Benjamin Bannier Avatar answered Nov 15 '22 18:11

Benjamin Bannier