Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LDA gensim implementation, distance between two different docs

EDIT: I've found an interesting issue here. This link shows that gensim uses randomness in both training and inference steps. So what it suggested here is to set a fixed seed in order to get same results every time. Why however am I getting for every topic the same probability?

What I want to do is to find for every twitter user her topics and calculate the similarity between twitter users based on similarities in topics. Is there any possibility to calculate the same topics for every user in gensim or do I have to calculate a dictionary of topics and cluster every user topic?

In general, which is the best way to compare two twitter users based on topic-models extraction in gensim? My code is the following:

   def preprocess(id): #Returns user word list (or list of user tweet)

        user_list =  user_corpus(id, 'user_'+str(id)+'.txt')
        documents = []
        for line in open('user_'+str(id)+'.txt'):
                 documents.append(line)
        #remove stop words
        lines = [line.rstrip() for line in open('stoplist.txt')]
        stoplist= set(lines)  
        texts = [[word for word in document.lower().split() if word not in stoplist]
                   for document in documents]
        # remove words that appear only once
        all_tokens = sum(texts, [])
        tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) < 3)
        texts = [[word for word in text if word not in tokens_once]
                   for text in texts]
        words = []
        for text in texts:
            for word in text:
                words.append(word)

        return words


    words1 = preprocess(14937173)
    words2 = preprocess(15386966)
    #Load the trained model
    lda = ldamodel.LdaModel.load('tmp/fashion1.lda')
    dictionary = corpora.Dictionary.load('tmp/fashion1.dict') #Load the trained dict

    corpus = [dictionary.doc2bow(words1)]
    tfidf = models.TfidfModel(corpus)
    corpus_tfidf = tfidf[corpus]
    corpus_lda = lda[corpus_tfidf]

    list1 = []
    for item in corpus_lda:
      list1.append(item)

    print lda.show_topic(0)
    corpus2 = [dictionary.doc2bow(words2)]
    tfidf2 = models.TfidfModel(corpus2)
    corpus_tfidf2 = tfidf2[corpus2]
    corpus_lda2 = lda[corpus_tfidf2]

    list2 = []
    for it in corpus_lda2:
      list2.append(it)

    print corpus_lda.show_topic(0)  

Returned topic probabilities for user corpus (when using as a corpus a list of user words):

 [(0, 0.10000000000000002), (1, 0.10000000000000002), (2, 0.10000000000000002),
  (3, 0.10000000000000002), (4, 0.10000000000000002), (5, 0.10000000000000002),
  (6, 0.10000000000000002), (7, 0.10000000000000002), (8, 0.10000000000000002),
  (9, 0.10000000000000002)]

In the case where I use a list of user tweets, I get back calculated topics for every tweet.

Question 2: Does the following make sense: training LDA model with several twitter users and calculating the topic for every user (with every user corpus), using the LDA model calculated before?

In provided example, list[0] returns topic distribution with equal probabilities 0.1. Basically, every line of text corresponds to a different tweet. If I calculate corpus with corpus = [dictionary.doc2bow(text) for text in texts] it will give me the probabilities for every tweet separately. On the other hand, if I use corpus = [dictionary.doc2bow(words)] like the example, I'll have just all user words as corpus. In the second case, gensim returns the same probabilities for all topics. Thus, for both users I am getting the same topic distributions.

Should user text corpus be a list of words or a list of sentences (a list of tweets)?

Regarding the implementation of Qi He and Jianshu Weng in twitterRank approach in page 264 it says that: we aggregate the tweets published by individual twitterer into a big document. Thus, each document corresponds to a twitterer. Ok I am confused, if document will be all user tweets then what should the corpus contain??

like image 756
Jose Ramon Avatar asked Oct 31 '22 21:10

Jose Ramon


2 Answers

According to official document, Latent Dirichlet Allocation, LDA is a transformation from bag-of-words counts into a topic space of lower dimensionality.

You can use LSI on the top of TFIDF, but not LDA. If you use TFIDF on LDA, then it will generate each topic almost the same, you can print and check it.

Also see https://radimrehurek.com/gensim/tut2.html.

like image 176
Hao Fu Avatar answered Nov 15 '22 03:11

Hao Fu


Fere Res check the following suggestion here. Firstly you have to calculate the lda model from all users and then with the use of the extracted vector of the unknown doc, which is calculated here as

vec_bow = dictionary.doc2bow(doc.lower().split()) 
vec_lda = lda[vec_bow]

If you print the following : print(vec_lda) you'll get the distribution of unseen document to lda model topics.

like image 44
christosh Avatar answered Nov 15 '22 03:11

christosh