I have list of strings. If any string contains the '#' character then I want to extract the first part of the string and get the frequency count of word tokens from this part of string only. i.e if the string is "first question # on stackoverflow" expected tokens are "first","question"
If the string does not contain '#' then return tokens of the whole string.
To compute the term document matrix I am using CountVectorizer
from scikit.
Find below my code:
class MyTokenizer(object):
def __call__(self,s):
if(s.find('#')==-1):
return s
else:
return s.split('#')[0]
def FindKmeans():
text = ["first ques # on stackoverflow", "please help"]
vec = CountVectorizer(tokenizer=MyTokenizer(), analyzer = 'word')
pos_vector = vec.fit_transform(text).toarray()
print(vec.get_feature_names())`
output : [u' ', u'a', u'e', u'f', u'h', u'i', u'l', u'p', u'q', u'r', u's', u't', u'u']
Expected Output : [u'first', u'ques', u'please', u'help']
Using Scikit-learn CountVectorizer: 1 1. Stop Words: You can pass the stop_words list as an argument. The stop words are words that are not significant and occur frequently. For example ... 2 2. Using min_df: 3 3. Using max_df: 4 4. Tokenizer: 5 5. Custom Preprocessing: More items
Scikit-learn’s CountVectorizer is used to transform a corpora of text to a vector of term / token counts. It also provides the capability to preprocess your text data prior to generating the vector representation making it a highly flexible feature representation module for text.
Tokenizer: If you want to specify your custom tokenizer, you can create a function and pass it to the count vectorizer during the initialization. We have used NLTK library to tokenize our text in the example below:
Counting words in Python with sklearn's CountVectorizer # 1 Using CountVectorizer #. While Counter is used for counting all sorts of things, the CountVectorizer is specifically used for counting words. 2 Understanding CountVectorizer #. Let's break it down line by line. ... 3 CountVectorizer in practice #. ...
You could split on your separator(#
) at most once and take the first part of the split.
from sklearn.feature_extraction.text import CountVectorizer
def tokenize(text):
return([text.split('#', 1)[0].strip()])
text = ["first ques # on stackoverflow", "please help"]
vec = CountVectorizer(tokenizer=tokenize)
data = vec.fit_transform(text).toarray()
vocab = vec.get_feature_names()
required_list = []
for word in vocab:
required_list.extend(word.split())
print(required_list)
#['first', 'ques', 'please', 'help']
The problem lays with your tokenizer, you've split the string into the bits you want to keep and the bits you don't want to keep, but you've not split the string into words. Try using the tokenizer below
class MyTokenizer(object):
def __call__(self,s):
if(s.find('#')==-1):
return s.split(' ')
else:
return s.split('#')[0].split(' ')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With