large amount of data in many text files - how to process?

Tags:

I have large amounts of data (a few terabytes) and accumulating... They are contained in many tab-delimited flat text files (each about 30MB). Most of the task involves reading the data and aggregating (summing/averaging + additional transformations) over observations/rows based on a series of predicate statements, and then saving the output as text, HDF5, or SQLite files, etc. I normally use R for such tasks but I fear this may be a bit large. Some candidate solutions are to

write the whole thing in C (or Fortran)
import the files (tables) into a relational database directly and then pull off chunks in R or Python (some of the transformations are not amenable for pure SQL solutions)
write the whole thing in Python

Would (3) be a bad idea? I know you can wrap C routines in Python but in this case since there isn't anything computationally prohibitive (e.g., optimization routines that require many iterative calculations), I think I/O may be as much of a bottleneck as the computation itself. Do you have any recommendations on further considerations or suggestions? Thanks

Edit Thanks for your responses. There seems to be conflicting opinions about Hadoop, but in any case I don't have access to a cluster (though I can use several unnetworked machines)...

390

asked May 30 '10 05:05

hatmatrix

1 Answers

(3) is not necessarily a bad idea -- Python makes it easy to process "CSV" file (and despite the C standing for Comma, tab as a separator is just as easy to handle) and of course gets just about as much bandwidth in I/O ops as any other language. As for other recommendations, numpy, besides fast computation (which you may not need as per your statements) provides very handy, flexible multi-dimensional arrays, which may be quite handy for your tasks; and the standard library module multiprocessing lets you exploit multiple cores for any task that's easy to parallelize (important since just about every machine these days has multi-cores;-).

101

answered Sep 28 '22 05:09

Alex Martelli

Related questions
                            
                                Equivalent to GetTickCount() on Linux
                            
                                Tool for automatically creating data for django model [closed]
                            
                                How does the callback function work in multiprocessing map_async?
                            
                                Python: how to kill child process(es) when parent dies?
                            
                                How to detect with python if the string contains html code?
                            
                                How to create a copy of python iterator? [duplicate]
                            
                                Get indices of elements that are greater than a threshold in 2D numpy array
                            
                                What is the best way to map windows drives using Python?
                            
                                Rounding time in Python
                            
                                How to multiply a scalar throughout a specific column within a NumPy array?
                            
                                In Matplotlib, is there a way to know the list of available output format
                            
                                How to slice a deque? [duplicate]
                            
                                How can I turn Django Model objects into a dictionary and still have their foreign keys? [duplicate]
                            
                                Using Celery on processes and gevent in tasks at the same time
                            
                                How to add Python to Windows registry
                            
                                Python: requests.exceptions.ConnectionError. Max retries exceeded with url
                            
                                SKlearn import MLPClassifier fails
                            
                                Tensorflow Dictionary lookup with String tensor
                            
                                How do I define a unique property for a Model in Google App Engine?
                            
                                Generating and applying diffs in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

large amount of data in many text files - how to process?

Tags:

python

sql

r

large-files

large-data-volumes

hatmatrix

People also ask

1 Answers

Alex Martelli

Recent Activity

Donate For Us