Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Multi Label classification

I am looking to implement with Spark, a multi label classification algorithm with multi output, but I am surprised that there isn’t any model in Spark Machine Learning libraries that can do this.

How can I do this with Spark ?

Otherwise Scikit Learn Logistic Regresssion support multi label classification in input/output , but doesn't support a huge data for training.

to view the code in scikit learn, please click on the following link: https://gist.github.com/mkbouaziz/5bdb463c99ba9da317a1495d4635d0fc

like image 647
Mohamed Karim Bouaziz Avatar asked Aug 26 '16 13:08

Mohamed Karim Bouaziz


People also ask

How to solve multi-class text classification problem in spark?

The classifier makes the assumption that each new crime description is assigned to one and only one category. This is multi-class text classification problem. To solve this problem, we will use a variety of feature extraction technique along with different supervised machine learning algorithms in Spark. Let’s get started!

How to perform a multilabel classification using mlknn?

Several approaches can be used to perform a multilabel classification, the one employed here will be MLKnn, which is an adaptation of the famous Knn algorithm, just like its predecessor MLKnn infers the classes of the target based on the distance between it and the data from the training base but assuming it may belong to none or all the classes.

Why should you use spark for machine learning?

In addition, Apache Spark is fast enough to perform exploratory queries without sampling. Many industry experts have provided all the reasons why you should use Spark for Machine Learning? So, here we are now, using Spark Machine Learning Library to solve a multi-class text classification problem, in particular, PySpark.

What is Spark Machine learning pipeline API?

Spark Machine Learning Pipelines API is similar to Scikit-Learn. Our pipeline includes three steps: StringIndexer encodes a string column of labels to a column of label indices. The indices are in [0, numLabels), ordered by label frequencies, so the most frequent label gets index 0.


1 Answers

Also in Spark there is Logistic Regression that supports multilabel classification based on the api documentation. See also this.

The problem that you have on scikitlearn for the huge amount of training data will disappear with spark, using an appropriate Spark configuration.

Another approach is to use binary classifiers for each of the labels that your problem has, and get multilabel by running relevant-irrelevant predictions for that label. You can easily do that in Spark using any binary classifier.

Indirectly, what might also be of help, is to use multilabel categorization with nearest-neighbors, which is also state-of-the-art. Some nearest neighbors Spark extensions, like Spark KNN or Spark KNN graphs, for instance.

like image 120
marilena.oita Avatar answered Oct 05 '22 04:10

marilena.oita