Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Automatic labeling of LDA generated topics

I'm trying to categorize customer feedback and I ran an LDA in python and got the following output for 10 topics:

(0, u'0.559*"delivery" + 0.124*"area" + 0.018*"mile" + 0.016*"option" + 0.012*"partner" + 0.011*"traffic" + 0.011*"hub" + 0.011*"thanks" + 0.010*"city" + 0.009*"way"')
(1, u'0.397*"package" + 0.073*"address" + 0.055*"time" + 0.047*"customer" + 0.045*"apartment" + 0.037*"delivery" + 0.031*"number" + 0.026*"item" + 0.021*"support" + 0.018*"door"')
(2, u'0.190*"time" + 0.127*"order" + 0.113*"minute" + 0.075*"pickup" + 0.074*"restaurant" + 0.031*"food" + 0.027*"support" + 0.027*"delivery" + 0.026*"pick" + 0.018*"min"')
(3, u'0.072*"code" + 0.067*"gps" + 0.053*"map" + 0.050*"street" + 0.047*"building" + 0.043*"address" + 0.042*"navigation" + 0.039*"access" + 0.035*"point" + 0.028*"gate"')
(4, u'0.434*"hour" + 0.068*"time" + 0.034*"min" + 0.032*"amount" + 0.024*"pay" + 0.019*"gas" + 0.018*"road" + 0.017*"today" + 0.016*"traffic" + 0.014*"load"')
(5, u'0.245*"route" + 0.154*"warehouse" + 0.043*"minute" + 0.039*"need" + 0.039*"today" + 0.026*"box" + 0.025*"facility" + 0.025*"bag" + 0.022*"end" + 0.020*"manager"')
(6, u'0.371*"location" + 0.110*"pick" + 0.097*"system" + 0.040*"im" + 0.038*"employee" + 0.022*"evening" + 0.018*"issue" + 0.015*"request" + 0.014*"while" + 0.013*"delivers"')
(7, u'0.182*"schedule" + 0.181*"please" + 0.059*"morning" + 0.050*"application" + 0.040*"payment" + 0.026*"change" + 0.025*"advance" + 0.025*"slot" + 0.020*"date" + 0.020*"tomorrow"')
(8, u'0.138*"stop" + 0.110*"work" + 0.062*"name" + 0.055*"account" + 0.046*"home" + 0.043*"guy" + 0.030*"address" + 0.026*"city" + 0.025*"everything" + 0.025*"feature"') 

Is there a way to automatically label them? I do have a csv file which has feedbacks manually labeled, but I do not want to supply these labels myself. I want the model to create labels. Is it possible?

like image 935
Arman Avatar asked May 15 '17 17:05

Arman


1 Answers

The comments here link to another SO answer that links to a paper. Let's say you wanted to do the minimum to try to make this work. Here is an MVP-style solution that has worked for me: search Google for the terms, then look for keywords in the response.

Here is some working, though hacky, code:

pip install cssselect

then

from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
from collections import Counter


def get_srp_text(search_term):
    raw = get(f"https://www.google.com/search?q={topic_terms}").text
    page = fromstring(raw)


    blob = ""

    for result in page.cssselect("a"):
        for res in result.findall("div"):
            blob += ' '
            blob += res.text if res.text else " "
            blob += ' '
    return blob


def blob_cleaner(blob):
    clean_blob = blob.replace(r'[\/,\(,\),\:,_,-,\-]', ' ')
    return ''.join(e for e in blob if e.isalnum() or e.isspace())


def get_name_from_srp_blob(clean_blob):
    blob_tokens = list(filter(bool, map(lambda x: x if len(x) > 2 else '', clean_blob.split(' '))))
    c = Counter(blob_tokens)
    most_common = c.most_common(10)

    name = f"{most_common[0][0]}-{most_common[1][0]}"
    return name

pipeline = lambda x: get_name_from_srp_blob(blob_cleaner(get_srp_text(x)))

Then you can just get the topic words from your model, e.g.

topic_terms = "delivery area mile option partner traffic hub thanks city way"

name = pipeline(topic_terms)
print(name)

>>> City-Transportation

and

topic_terms = "package address time customer apartment delivery number item support door"

name = pipeline(topic_terms)
print(name)

>>> Parcel-Package

You could improve this up a lot. For example, you could use POS tags to only find the most common nouns, then use those for the name. Or find the most common adjective and noun, and make the name "Adjective Noun". Even better, you could get the text from the linked sites, then run YAKE to extract keywords.

Regardless, this demonstrates a simple way to automatically name clusters, without directly using machine learning (though, Google is most certainly using it to generate the search results, so you are benefitting from it).

like image 86
Sam H. Avatar answered Sep 19 '22 16:09

Sam H.