Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

large amount of data in many text files - how to process?

I have large amounts of data (a few terabytes) and accumulating... They are contained in many tab-delimited flat text files (each about 30MB). Most of the task involves reading the data and aggregating (summing/averaging + additional transformations) over observations/rows based on a series of predicate statements, and then saving the output as text, HDF5, or SQLite files, etc. I normally use R for such tasks but I fear this may be a bit large. Some candidate solutions are to

  1. write the whole thing in C (or Fortran)
  2. import the files (tables) into a relational database directly and then pull off chunks in R or Python (some of the transformations are not amenable for pure SQL solutions)
  3. write the whole thing in Python

Would (3) be a bad idea? I know you can wrap C routines in Python but in this case since there isn't anything computationally prohibitive (e.g., optimization routines that require many iterative calculations), I think I/O may be as much of a bottleneck as the computation itself. Do you have any recommendations on further considerations or suggestions? Thanks

Edit Thanks for your responses. There seems to be conflicting opinions about Hadoop, but in any case I don't have access to a cluster (though I can use several unnetworked machines)...

like image 390
hatmatrix Avatar asked May 30 '10 05:05

hatmatrix


People also ask

How do you process a large text file in Python?

Reading Large Text Files in Python We can use the file object as an iterator. The iterator will return each line one by one, which can be processed. This will not read the whole file into memory and it's suitable to read large files in Python.

How can I open large text files?

To be able to open such large CSV files, you need to download and use a third-party application. If all you want is to view such files, then Large Text File Viewer is the best choice for you. For actually editing them, you can try a feature-rich text editor like Emacs, or go for a premium tool like CSV Explorer.

How do I convert a large text file to Excel?

The steps to import a TXT or CSV file into Excel are similar for Excel 2007, 2010, 2013, and 2016: Open the Excel spreadsheet where you want to save the data and click the Data tab. In the Get External Data group, click From Text. Select the TXT or CSV file you want to convert and click Import.


1 Answers

(3) is not necessarily a bad idea -- Python makes it easy to process "CSV" file (and despite the C standing for Comma, tab as a separator is just as easy to handle) and of course gets just about as much bandwidth in I/O ops as any other language. As for other recommendations, numpy, besides fast computation (which you may not need as per your statements) provides very handy, flexible multi-dimensional arrays, which may be quite handy for your tasks; and the standard library module multiprocessing lets you exploit multiple cores for any task that's easy to parallelize (important since just about every machine these days has multi-cores;-).

like image 101
Alex Martelli Avatar answered Sep 28 '22 05:09

Alex Martelli