I am looking to implement with Spark, a multi label classification algorithm with multi output, but I am surprised that there isn’t any model in Spark Machine Learning libraries that can do this.
How can I do this with Spark ?
Otherwise Scikit Learn Logistic Regresssion support multi label classification in input/output , but doesn't support a huge data for training.
to view the code in scikit learn, please click on the following link: https://gist.github.com/mkbouaziz/5bdb463c99ba9da317a1495d4635d0fc
The classifier makes the assumption that each new crime description is assigned to one and only one category. This is multi-class text classification problem. To solve this problem, we will use a variety of feature extraction technique along with different supervised machine learning algorithms in Spark. Let’s get started!
Several approaches can be used to perform a multilabel classification, the one employed here will be MLKnn, which is an adaptation of the famous Knn algorithm, just like its predecessor MLKnn infers the classes of the target based on the distance between it and the data from the training base but assuming it may belong to none or all the classes.
In addition, Apache Spark is fast enough to perform exploratory queries without sampling. Many industry experts have provided all the reasons why you should use Spark for Machine Learning? So, here we are now, using Spark Machine Learning Library to solve a multi-class text classification problem, in particular, PySpark.
Spark Machine Learning Pipelines API is similar to Scikit-Learn. Our pipeline includes three steps: StringIndexer encodes a string column of labels to a column of label indices. The indices are in [0, numLabels), ordered by label frequencies, so the most frequent label gets index 0.
Also in Spark there is Logistic Regression that supports multilabel classification based on the api documentation. See also this.
The problem that you have on scikitlearn for the huge amount of training data will disappear with spark, using an appropriate Spark configuration.
Another approach is to use binary classifiers for each of the labels that your problem has, and get multilabel by running relevant-irrelevant predictions for that label. You can easily do that in Spark using any binary classifier.
Indirectly, what might also be of help, is to use multilabel categorization with nearest-neighbors, which is also state-of-the-art. Some nearest neighbors Spark extensions, like Spark KNN or Spark KNN graphs, for instance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With