Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python tools for out-of-core computation/data mining

I am interested in python mining data sets too big to sit in RAM but sitting within a single HD.

I understand that I can export the data as hdf5 files, using pytables. Also the numexpr allows for some basic out-of-core computation.

What would come next? Mini-batching when possible, and relying on linear algebra results to decompose the computation when mini-batching cannot be used?

Or are there some higher level tools I have missed?

Thanks for insights,

like image 837
user17375 Avatar asked Jan 23 '13 14:01

user17375


People also ask

Which tool is used for data mining and data analysis in Python?

Scikit-learn Scikit-learn is a free software tool for machine learning in Python, providing outstanding data mining capabilities and data analysis. It offers a vast number of features such as classification, regression, clustering, preprocessing, model selection and dimension reduction.

Is Python a data mining tool?

Python's ease of use, coupled with many of its many powerful modules, making it a versatile tool for data mining and analysis, especially for those looking for the gold in their mountains of data.

What is data mining with Python?

Data mining is the process of discovering predictive information from the analysis of large databases. For a data scientist, data mining can be a vague and daunting task – it requires a diverse set of skills and knowledge of many data mining techniques to take raw data and successfully get insights from it.


2 Answers

What exactly do you want to do — can you give an example or two please ?

numpy.memmap is easy —

Create a memory-map to an array stored in a binary file on disk.
Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory. Numpy's memmap's are array-like objects ...

see also numpy+memmap on SO.

The scikit-learn people are very knowledgeable, but prefer specific questions.

like image 165
denis Avatar answered Oct 23 '22 06:10

denis


In sklearn 0.14 (to be released in the coming days) there is a full-fledged example of out-of-core classification of text documents.

I think it could be a great example to start with :

http://scikit-learn.org/dev/auto_examples/applications/plot_out_of_core_classification.html

In the next release we'll extend this example with more classifiers and add documentation in the user guide.

NB: you can reproduce this example with 0.13 too, all the building blocks were already there.

like image 25
oDDsKooL Avatar answered Oct 23 '22 04:10

oDDsKooL