I have large amounts of data (a few terabytes) and accumulating... They are contained in many tab-delimited flat text files (each about 30MB). Most of the task involves reading the data and aggregating (summing/averaging + additional transformations) over observations/rows based on a series of predicate statements, and then saving the output as text, HDF5, or SQLite files, etc. I normally use R for such tasks but I fear this may be a bit large. Some candidate solutions are to
Would (3) be a bad idea? I know you can wrap C routines in Python but in this case since there isn't anything computationally prohibitive (e.g., optimization routines that require many iterative calculations), I think I/O may be as much of a bottleneck as the computation itself. Do you have any recommendations on further considerations or suggestions? Thanks
Edit Thanks for your responses. There seems to be conflicting opinions about Hadoop, but in any case I don't have access to a cluster (though I can use several unnetworked machines)...
Reading Large Text Files in Python We can use the file object as an iterator. The iterator will return each line one by one, which can be processed. This will not read the whole file into memory and it's suitable to read large files in Python.
To be able to open such large CSV files, you need to download and use a third-party application. If all you want is to view such files, then Large Text File Viewer is the best choice for you. For actually editing them, you can try a feature-rich text editor like Emacs, or go for a premium tool like CSV Explorer.
The steps to import a TXT or CSV file into Excel are similar for Excel 2007, 2010, 2013, and 2016: Open the Excel spreadsheet where you want to save the data and click the Data tab. In the Get External Data group, click From Text. Select the TXT or CSV file you want to convert and click Import.
(3) is not necessarily a bad idea -- Python makes it easy to process "CSV" file (and despite the C standing for Comma, tab as a separator is just as easy to handle) and of course gets just about as much bandwidth in I/O ops as any other language. As for other recommendations, numpy
, besides fast computation (which you may not need as per your statements) provides very handy, flexible multi-dimensional arrays, which may be quite handy for your tasks; and the standard library module multiprocessing
lets you exploit multiple cores for any task that's easy to parallelize (important since just about every machine these days has multi-cores;-).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With