Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handle large data pools in python

I'm working on an academic project aimed at studying people behavior.

The project will be divided in three parts:

  1. A program to read the data from some remote sources, and build a local data pool with it.
  2. A program to validate this data pool, and to keep it coherent
  3. A web interface to allow people to read/manipulate the data.

The data consists of a list of people, all with an ID #, and with several characteristics: height, weight, age, ...

I need to easily make groups out of this data (e.g.: all with a given age, or a range of heights) and the data is several TB big (but can reduced in smaller subsets of 2-3 gb).

I have a strong background on the theoretical stuff behind the project, but I'm not a computer scientist. I know java, C and Matlab, and now I'm learning python.

I would like to use python since it seems easy enough and greatly reduce the verbosity of Java. The problem is that I'm wondering how to handle the data pool.

I'm no expert of databases but I guess I need one here. What tools do you think I should use?

Remember that the aim is to implement very advanced mathematical functions on sets of data, thus we want to reduce complexity of source code. Speed is not an issue.

like image 792
Mascarpone Avatar asked Jan 20 '23 02:01

Mascarpone


2 Answers

Sounds that the main functionality needed can be found from:
pytables
and
scipy/numpy

like image 106
eat Avatar answered Jan 29 '23 20:01

eat


Go with a NoSQL database like MongoDB which is much easier to handle data in such a case than having to learn SQL.

like image 26
Andreas Jung Avatar answered Jan 29 '23 21:01

Andreas Jung