Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Wordcloud of bigram using Python

I am generating a word cloud directly from the text file using Wordcloud packge in python. Here is the code that I am re-using from stckoverflow:

import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS


def random_color_func(word=None, font_size=None, position=None, orientation=None, font_path=None, random_state=None):
    h = int(360.0 * 45.0 / 255.0)
    s = int(100.0 * 255.0 / 255.0)
    l = int(100.0 * float(random_state.randint(60, 120)) / 255.0)

    return "hsl({}, {}%, {}%)".format(h, s, l)

file_content=open ("xyz.txt").read()

wordcloud = WordCloud(font_path = r'C:\Windows\Fonts\Verdana.ttf',
                            stopwords = STOPWORDS,
                            background_color = 'white',
                            width = 1200,
                            height = 1000,
                            color_func = random_color_func
                            ).generate(file_content)

plt.imshow(wordcloud,interpolation="bilinear")
plt.axis('off')
plt.show()

It is giving me wordcloud of single words. Is there any parameter in WordCloud() function to pass n-gram without formating the text file.

I want word cloud of bigram. Or words attached with underscore in display. Like: machine_learning ( Machine and Learning would be 2 different words)

like image 317
DreamerP Avatar asked Dec 03 '22 11:12

DreamerP


1 Answers

Bigram wordclouds can easily be generated by reducing the value of collocation_threshold parameter in WordCloud.

Edit the wordcloud:

wordcloud = WordCloud(font_path = r'C:\Windows\Fonts\Verdana.ttf',
                            stopwords = STOPWORDS,
                            background_color = 'white',
                            width = 1200,
                            height = 1000,
                            color_func = random_color_func,
                            collocation_threshold = 3               --added this to your question code, try changing this value between 1-50
                            ).generate(file_content)

For more info:

collocation_threshold: int, default=30 Bigrams must have a Dunning likelihood collocation score greater than this parameter to be counted as bigrams. Default of 30 is arbitrary.

You can also find the source code for wordcloud.WordCloud here: https://amueller.github.io/word_cloud/_modules/wordcloud/wordcloud.html

like image 188
Himal Avatar answered Dec 27 '22 01:12

Himal