I am interested in python mining
data sets too big to sit in RAM but sitting within a single HD.
I understand that I can export the data as hdf5
files, using pytables
. Also the numexpr
allows for some basic out-of-core computation.
What would come next? Mini-batching when possible, and relying on linear algebra results to decompose the computation when mini-batching cannot be used?
Or are there some higher level tools I have missed?
Thanks for insights,
Scikit-learn Scikit-learn is a free software tool for machine learning in Python, providing outstanding data mining capabilities and data analysis. It offers a vast number of features such as classification, regression, clustering, preprocessing, model selection and dimension reduction.
Python's ease of use, coupled with many of its many powerful modules, making it a versatile tool for data mining and analysis, especially for those looking for the gold in their mountains of data.
Data mining is the process of discovering predictive information from the analysis of large databases. For a data scientist, data mining can be a vague and daunting task – it requires a diverse set of skills and knowledge of many data mining techniques to take raw data and successfully get insights from it.
What exactly do you want to do — can you give an example or two please ?
numpy.memmap is easy —
Create a memory-map to an array stored in a binary file on disk.
Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory. Numpy's memmap's are array-like objects ...
see also numpy+memmap on SO.
The scikit-learn people are very knowledgeable, but prefer specific questions.
In sklearn 0.14 (to be released in the coming days) there is a full-fledged example of out-of-core classification of text documents.
I think it could be a great example to start with :
http://scikit-learn.org/dev/auto_examples/applications/plot_out_of_core_classification.html
In the next release we'll extend this example with more classifiers and add documentation in the user guide.
NB: you can reproduce this example with 0.13 too, all the building blocks were already there.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With