Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: handling a large set of data. Scipy or Rpy? And how?

In my python environment, the Rpy and Scipy packages are already installed.

The problem I want to tackle is such:

1) A huge set of financial data are stored in a text file. Loading into Excel is not possible

2) I need to sum a certain fields and get the totals.

3) I need to show the top 10 rows based on the totals.

Which package (Scipy or Rpy) is best suited for this task?

If so, could you provide me some pointers (e.g. documentation or online example) that can help me to implement a solution?

Speed is a concern. Ideally scipy and Rpy can handle the large files when even when the files are so large that they cannot be fitted into memory

like image 369
Anthony Kong Avatar asked Jul 28 '11 03:07

Anthony Kong


2 Answers

Neither Rpy or Scipy is necessary, although numpy may make it a bit easier. This problem seems ideally suited to a line-by-line parser. Simply open the file, read a row into a string, scan the row into an array (see numpy.fromstring), update your running sums and move to the next line.

like image 148
so12311 Avatar answered Sep 24 '22 04:09

so12311


Python's File I/O doesn't have bad performance, so you can just use the file module directly. You can see what functions are available in it by typing help (file) in the interactive interpreter. Creating a file is part of the core language functionality and doesn't require you to import file.

Something like:

f = open ("C:\BigScaryFinancialData.txt", "r");
for line in f.readlines():
    #line is a string type
    #do whatever you want to do on a per-line basis here, for example:
    print len(line)

Disclaimer: This is a Python 2 answer. I'm not 100% sure this works in Python 3.

I'll leave it to you to figure out how to show the top 10 rows and find the row sums. This can be done with simple program logic that shouldn't be a problem without any special libraries. Of course, if the rows have some kind of complicated formatting that makes it difficult to parse out the values, you might want to use some kind of module for parsing, re for example (type help(re) into the interactive interpreter).

like image 45
Gravity Avatar answered Sep 20 '22 04:09

Gravity