Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to improve word assignement in different topics in lda

I am working on a language that is the not english and I have scraped the data from different sources. I have done my preprocessing like punctuation removal, stop-words removal and tokenization. Now I want to extract domain specific lexicons. Let's say that I have data related to sports, entertainment, etc and I want to extract words that are related to these particular fields, like cricket etc, and place them in topics that are closely related. I tried to use lda for this, but I am not getting the correct clusters. Also in the clusters in which a word which is a part of one topic, it also appears in other topics.

How can I improve my results?

 # URDU STOP WORDS REMOVAL
        doc_clean = []
        stopwords_corpus = UrduCorpusReader('./data', ['stopwords-ur.txt'])    
        stopwords = stopwords_corpus.words()
        # print(stopwords)
        for infile in (wordlists.fileids()):
            words = wordlists.words(infile)
            #print(words)


            finalized_words = remove_urdu_stopwords(stopwords, words)
            doc = doc_clean.append(finalized_words)

            print("\n==== WITHOUT STOPWORDS ===========\n")
            print(finalized_words)

            # making dictionary and corpus
        dictionary  = corpora.Dictionary(doc_clean)
        # convert tokenized documents into a document-term matrix
        matrx= [dictionary.doc2bow(text) for text in doc_clean]
        # generate LDA model
        lda = models.ldamodel.LdaModel(corpus=matrx, id2word=dictionary, num_topics=5, passes=10)
        for top in lda.print_topics():
                print("\n===topics from files===\n")
                print (top)
like image 340
user3778289 Avatar asked Mar 09 '23 06:03

user3778289


1 Answers

LDA and its drawbacks: The idea of LDA is to uncover latent topics from your corpus. A drawback of this unsupervised machine learning approach, is that you will end up with topics that may be hard to interpret by humans. Another drawback is that you will most likely end up with some generic topics including words that appear in every document (like 'introduction', 'date', 'author' etc.). Thirdly, you will not be able to uncover latent topics that are simply not present enough. If you have only 1 article about cricket, it will not be recognised by the algorithm.

Why LDA doesn't fit your case: You are searching for explicit topics like cricket and you want to learn something about cricket vocabulary, correct? However, LDA will output some topics and you need to recognise cricket vocabulary in order to determine that e.g. topic 5 is concerned with cricket. Often times the LDA will identify topics that are mixed with other -related- topics. Keeping this in mind, there are three scenarios:

  1. You don't know anything about cricket, but you are able to identify the topic that's concerned with cricket.
  2. You are a cricket expert and already know the cricket vocabulary
  3. You don't know anything about cricket and are not able to identify the semantic topic that the LDA produced.

In the first case, you will have the problem that you are likely to associate words with cricket, that are actually not related to cricket, because you count on the LDA output to provide high-quality topics that are only concerned with cricket and no other related topics or generic terms. In the second case, you don't need the analysis in the first place, because you already know the cricket vocabulary! The third case is likely when you are relying on your computer to interpret the topics. However, in LDA you always rely on humans to give a semantic interpretation of the output.

So what to do: There's a paper called Targeted Topic Modeling for Focused Analysis (Wang 2016), which tries to identify which documents are concerned with a pre-defined topic (like cricket). If you have a list of topics for which you'd like to get some topic-specific vocabulary (cricket, basketball, romantic comedies, ..), a starting point could be to first identify relevant documents to then proceed and analyse the word-distributions of the documents related to a certain topic.

Note that perhaps there are completely different methods that will perform exactly what you're looking for. If you want to stay in the LDA-related literature, I'm relatively confident that the article I linked is your best shot.

Edit: If this answer is useful to you, you may find my paper interesting, too. It takes a labeled dataset of academic economics papers (600+ possible labels) and tries various LDA flavours to get the best predictions on new academic papers. The repo contains my code, documentation and also the paper itself

like image 129
KenHBS Avatar answered Mar 10 '23 22:03

KenHBS