I have about a 100 csv files each 100,000 x 40 <strike>rows</strike> columns. I'd like to do some statistical analysis on it, pull out some sample data, plot general trends, do variance and R-square analysis, and plot some spectra diagrams. For now, I'm considering numpy for the analysis. I was wondering what issues should I expect with such large files? I've already checked for erroneous data. What are your recommendations on doing statistical analysis? would it be better if I just split the files and do the whole thing in Excel?

For massive datasets you might be interested in ROOT. It can be used to analyze and very effectively store petabytes of data. It also come with some basic and more advanced statistics tools. While it is written to be used with C++, there are also pretty complete python bindings. They don't make it extremely easy to get direct access to the raw data (e.g. to use them in R or numpy) -- but it is definitely possible (I do it all the time).

Python: Analysis on CSV files 100,000 lines x 40 columns

Tags:

python

numpy

I have about a 100 csv files each 100,000 x 40 ~~rows~~ columns. I'd like to do some statistical analysis on it, pull out some sample data, plot general trends, do variance and R-square analysis, and plot some spectra diagrams. For now, I'm considering numpy for the analysis.

I was wondering what issues should I expect with such large files? I've already checked for erroneous data. What are your recommendations on doing statistical analysis? would it be better if I just split the files and do the whole thing in Excel?

332

asked Jan 26 '10 20:01

dassouki

2 Answers

I've found that Python + CSV is probably the fastest, and simplest way to do some kinds of statistical processing.

We do a fair amount of reformatting and correcting for odd data errors, so Python helps us.

The availability of Python's functional programming features makes this particularly simple. You can do sampling with tools like this.

def someStatFunction( source ):
    for row in source:
        ...some processing...

def someFilterFunction( source ):
    for row in source:
        if someFunction( row ):
            yield row

# All rows
with open( "someFile", "rb" )  as source:
    rdr = csv.reader( source )
    someStatFunction( rdr )

# Filtered by someFilterFunction applied to each row
with open( "someFile", "rb" )  as source:
    rdr = csv.reader( source )
    someStatFunction( someFilterFunction( rdr ) )

I really like being able to compose more complex functions from simpler functions.

132

answered Nov 15 '22 17:11

S.Lott

For massive datasets you might be interested in ROOT. It can be used to analyze and very effectively store petabytes of data. It also come with some basic and more advanced statistics tools.

While it is written to be used with C++, there are also pretty complete python bindings. They don't make it extremely easy to get direct access to the raw data (e.g. to use them in R or numpy) -- but it is definitely possible (I do it all the time).

answered Nov 15 '22 18:11

Benjamin Bannier

Related questions
                            
                                How to downgrade Python from 3.7 to 3.5 in Anaconda [closed]
                            
                                pd.read_csv add column named "Unnamed: 0
                            
                                Saving result of DataFrame show() to string in pyspark
                            
                                deadlock detected when trying to start server
                            
                                Python Logic of ListNode in Leetcode
                            
                                Applying callbacks in a custom training loop in Tensorflow 2.0
                            
                                Can we plot image data in Altair?
                            
                                How can I add python type annotations to the flask global context g?
                            
                                import dataset into google colab from another drive account
                            
                                Unable to find custom Blender operator in F3 operator search (Blender 2.9)
                            
                                TypeError: Object of type function is not JSON serializable when using flask_jwt_extended int RESTful API
                            
                                pyPdf for IndirectObject extraction
                            
                                Google AppEngine: how to count a database's entries beyond 1000?
                            
                                How do I disassemble a Python script?
                            
                                mrdivide function in MATLAB: what is it doing, and how can I do it in Python?
                            
                                How do I install PyGTK / PyGobject on Windows with Python 2.6?
                            
                                Sharing utilities modules across python projects
                            
                                Optimized dot product in Python
                            
                                GAE and Django: What are the benefits? [closed]
                            
                                Is it possible to make user input invisible as a 'sudo' password input?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With