I have about a 100 csv files each 100,000 x 40 rows columns. I'd like to do some statistical analysis on it, pull out some sample data, plot general trends, do variance and R-square analysis, and plot some spectra diagrams. For now, I'm considering numpy for the analysis.
I was wondering what issues should I expect with such large files? I've already checked for erroneous data. What are your recommendations on doing statistical analysis? would it be better if I just split the files and do the whole thing in Excel?
To get the number of rows, and columns we can use len(df. axes[]) function in Python.
csv files have a limit of 32,767 characters per cell. Excel has a limit of 1,048,576 rows and 16,384 columns per sheet. CSV files can hold many more rows. You can read more about these limits and others from this Microsoft support article here.
I've found that Python + CSV is probably the fastest, and simplest way to do some kinds of statistical processing.
We do a fair amount of reformatting and correcting for odd data errors, so Python helps us.
The availability of Python's functional programming features makes this particularly simple. You can do sampling with tools like this.
def someStatFunction( source ):
for row in source:
...some processing...
def someFilterFunction( source ):
for row in source:
if someFunction( row ):
yield row
# All rows
with open( "someFile", "rb" ) as source:
rdr = csv.reader( source )
someStatFunction( rdr )
# Filtered by someFilterFunction applied to each row
with open( "someFile", "rb" ) as source:
rdr = csv.reader( source )
someStatFunction( someFilterFunction( rdr ) )
I really like being able to compose more complex functions from simpler functions.
For massive datasets you might be interested in ROOT. It can be used to analyze and very effectively store petabytes of data. It also come with some basic and more advanced statistics tools.
While it is written to be used with C++, there are also pretty complete python bindings. They don't make it extremely easy to get direct access to the raw data (e.g. to use them in R or numpy) -- but it is definitely possible (I do it all the time).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With