Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit Learn - Extract word tokens from a string delimiter using CountVectorizer

I have list of strings. If any string contains the '#' character then I want to extract the first part of the string and get the frequency count of word tokens from this part of string only. i.e if the string is "first question # on stackoverflow" expected tokens are "first","question"

If the string does not contain '#' then return tokens of the whole string.

To compute the term document matrix I am using CountVectorizer from scikit.

Find below my code:

class MyTokenizer(object):
    def __call__(self,s):
        if(s.find('#')==-1):
            return s
        else:
            return s.split('#')[0]
    def FindKmeans():
        text = ["first ques # on stackoverflow", "please help"]
        vec = CountVectorizer(tokenizer=MyTokenizer(), analyzer = 'word')
        pos_vector = vec.fit_transform(text).toarray()
        print(vec.get_feature_names())`

output : [u' ', u'a', u'e', u'f', u'h', u'i', u'l', u'p', u'q', u'r', u's', u't', u'u']

Expected Output : [u'first', u'ques', u'please', u'help']
like image 589
Rashmi Singh Avatar asked Aug 02 '16 08:08

Rashmi Singh


People also ask

How to use scikit-learn countvectorizer?

Using Scikit-learn CountVectorizer: 1 1. Stop Words: You can pass the stop_words list as an argument. The stop words are words that are not significant and occur frequently. For example ... 2 2. Using min_df: 3 3. Using max_df: 4 4. Tokenizer: 5 5. Custom Preprocessing: More items

What is countvectorizer used for in Python?

Scikit-learn’s CountVectorizer is used to transform a corpora of text to a vector of term / token counts. It also provides the capability to preprocess your text data prior to generating the vector representation making it a highly flexible feature representation module for text.

How to use custom tokenizer with Count vectorizer?

Tokenizer: If you want to specify your custom tokenizer, you can create a function and pass it to the count vectorizer during the initialization. We have used NLTK library to tokenize our text in the example below:

How to count words in Python with sklearn's countvectorizer?

Counting words in Python with sklearn's CountVectorizer # 1 Using CountVectorizer #. While Counter is used for counting all sorts of things, the CountVectorizer is specifically used for counting words. 2 Understanding CountVectorizer #. Let's break it down line by line. ... 3 CountVectorizer in practice #. ...


2 Answers

You could split on your separator(#) at most once and take the first part of the split.

from sklearn.feature_extraction.text import CountVectorizer

def tokenize(text):
    return([text.split('#', 1)[0].strip()])

text = ["first ques # on stackoverflow", "please help"]

vec = CountVectorizer(tokenizer=tokenize)
data = vec.fit_transform(text).toarray()
vocab = vec.get_feature_names()

required_list = []
for word in vocab:
    required_list.extend(word.split())
print(required_list)

#['first', 'ques', 'please', 'help']
like image 108
Nickil Maveli Avatar answered Sep 24 '22 13:09

Nickil Maveli


The problem lays with your tokenizer, you've split the string into the bits you want to keep and the bits you don't want to keep, but you've not split the string into words. Try using the tokenizer below

class MyTokenizer(object):
    def __call__(self,s):
        if(s.find('#')==-1):
            return s.split(' ')
        else:
            return s.split('#')[0].split(' ')
like image 44
piman314 Avatar answered Sep 25 '22 13:09

piman314