Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the correct way to prepare dataset for machine learning? [closed]

First of all, thank you for reading this post.

I am a noob when it comes to machine learning and I am trying to use ML to classify some data. Now I have done some basic reading on supervised and unsupervised learning algorithms such as decision trees, clustering, neural networks ..etc.

What I'm struggling to understand is the correct overall procedure for preparing datasets for a ML problem.

How do I prepare the dataset for ML so that I can measure the accuracy of the algorithms?

My current understanding is that to assess accuracy, the algorithm should be fed with pre-labelled results (from a significant subset of the dataset?) so as to assess the difference between the expected outcome and the algorithm's decision?

If this is correct then how does one go about pre-labelling large datasets? My dataset is quite big and manual labelling is not feasible.

Also, any tips on doing machine learning in Python would be much appreciated!

Thank you very much for your help in advance!

Best regards,

Mike

like image 231
Mike Avatar asked Oct 14 '13 12:10

Mike


1 Answers

This is the most important part of any machine learning algorithm. You need to build your dataset, extract, make, scale, normalize features.

If you want to use some supervised learning algorithm, you need labeled data. There is several ways to achieve this:

  1. Lebel it by hand.
  2. Use some unsupervised learning algorithm to label data.

You need to use some python machine-learning toolkit, for example - scikit-learn. scikit-learn contains many useful tools for data mangling, feature extraction and preprocessing. For example, it can vectorize your data with DictVictorizer. You can add missing values, scale and normalize features using only scikit-learn.

I recommend to start with examples here - http://scikit-learn.org/stable/

like image 174
Evgeny Lazin Avatar answered Nov 13 '22 04:11

Evgeny Lazin