First of all, thank you for reading this post.
I am a noob when it comes to machine learning and I am trying to use ML to classify some data. Now I have done some basic reading on supervised and unsupervised learning algorithms such as decision trees, clustering, neural networks ..etc.
What I'm struggling to understand is the correct overall procedure for preparing datasets for a ML problem.
How do I prepare the dataset for ML so that I can measure the accuracy of the algorithms?
My current understanding is that to assess accuracy, the algorithm should be fed with pre-labelled results (from a significant subset of the dataset?) so as to assess the difference between the expected outcome and the algorithm's decision?
If this is correct then how does one go about pre-labelling large datasets? My dataset is quite big and manual labelling is not feasible.
Also, any tips on doing machine learning in Python would be much appreciated!
Thank you very much for your help in advance!
Best regards,
Mike
This is the most important part of any machine learning algorithm. You need to build your dataset, extract, make, scale, normalize features.
If you want to use some supervised learning algorithm, you need labeled data. There is several ways to achieve this:
You need to use some python machine-learning toolkit, for example - scikit-learn. scikit-learn contains many useful tools for data mangling, feature extraction and preprocessing. For example, it can vectorize your data with DictVictorizer. You can add missing values, scale and normalize features using only scikit-learn.
I recommend to start with examples here - http://scikit-learn.org/stable/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With