Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generate N-Grams from strings with pandas

I have a DataFrame df like this:

Pattern    String                                       
101        hi, how are you?
104        what are you doing?
108        Python is good to learn.

I want to create ngrams for String Column. I've create unigram using split() and stack()

new= df.String.str.split(expand=True).stack()

However, I want to create ngrams (bi, tri, quad etc)

like image 904
Sandy Avatar asked Mar 07 '18 08:03

Sandy


People also ask

What is n-grams in NLTK?

Understanding N-grams Text n-grams are commonly utilized in natural language processing and text mining. It's essentially a string of words that appear in the same window at the same time. When computing n-grams, you normally advance one word (although in more complex scenarios you can move n-words).

What is syntactic n-grams?

4. Syntactic N-grams (sn-grams) As we already explained, syntactic n-grams (sn-grams) are n-grams that are constructed using paths in syntactic trees.


1 Answers

Do a little preprocessing on your text column, and then a little shifting + concatenation:

# generate unigrams 
unigrams  = (
    df['String'].str.lower()
                .str.replace(r'[^a-z\s]', '')
                .str.split(expand=True)
                .stack())

# generate bigrams by concatenating unigram columns
bigrams = unigrams + ' ' + unigrams.shift(-1)
# generate trigrams by concatenating unigram and bigram columns
trigrams = bigrams + ' ' + unigrams.shift(-2)

# concatenate all series vertically, and remove NaNs
pd.concat([unigrams, bigrams, trigrams]).dropna().reset_index(drop=True)

0                   hi
1                  how
2                  are
3                  you
4                 what
5                  are
6                  you
7                doing
8               python
9                   is
10                good
11                  to
12               learn
13              hi how
14             how are
15             are you
16            you what
17            what are
18             are you
19           you doing
20        doing python
21           python is
22             is good
23             good to
24            to learn
25          hi how are
26         how are you
27        are you what
28        you what are
29        what are you
30       are you doing
31    you doing python
32     doing python is
33      python is good
34          is good to
35       good to learn
dtype: object
like image 195
cs95 Avatar answered Sep 28 '22 07:09

cs95