What is the correct way to prepare dataset for machine learning? [closed]

Question

First of all, thank you for reading this post.

I am a noob when it comes to machine learning and I am trying to use ML to classify some data. Now I have done some basic reading on supervised and unsupervised learning algorithms such as decision trees, clustering, neural networks ..etc.

What I'm struggling to understand is the correct overall procedure for preparing datasets for a ML problem.

How do I prepare the dataset for ML so that I can measure the accuracy of the algorithms?

My current understanding is that to assess accuracy, the algorithm should be fed with pre-labelled results (from a significant subset of the dataset?) so as to assess the difference between the expected outcome and the algorithm's decision?

If this is correct then how does one go about pre-labelling large datasets? My dataset is quite big and manual labelling is not feasible.

Also, any tips on doing machine learning in Python would be much appreciated!

Thank you very much for your help in advance!

Best regards,

Mike

Evgeny Lazin · Accepted Answer

This is the most important part of any machine learning algorithm. You need to build your dataset, extract, make, scale, normalize features.

If you want to use some supervised learning algorithm, you need labeled data. There is several ways to achieve this:

Lebel it by hand.
Use some unsupervised learning algorithm to label data.

You need to use some python machine-learning toolkit, for example - scikit-learn. scikit-learn contains many useful tools for data mangling, feature extraction and preprocessing. For example, it can vectorize your data with DictVictorizer. You can add missing values, scale and normalize features using only scikit-learn.

I recommend to start with examples here - http://scikit-learn.org/stable/

What is the correct way to prepare dataset for machine learning? [closed]

Tags:

python

machine-learning

statistics

data-analysis

Mike

1 Answers

Evgeny Lazin

Recent Activity

Donate For Us

What is the correct way to prepare dataset for machine learning? [closed]

Tags:

python

machine-learning

statistics

data-analysis

Mike

1 Answers

Evgeny Lazin

Related questions

Recent Activity

Donate For Us