Scikit Learn - Extract word tokens from a string delimiter using CountVectorizer

Tags:

I have list of strings. If any string contains the '#' character then I want to extract the first part of the string and get the frequency count of word tokens from this part of string only. i.e if the string is "first question # on stackoverflow" expected tokens are "first","question"

If the string does not contain '#' then return tokens of the whole string.

To compute the term document matrix I am using CountVectorizer from scikit.

Find below my code:

class MyTokenizer(object):
    def __call__(self,s):
        if(s.find('#')==-1):
            return s
        else:
            return s.split('#')[0]
    def FindKmeans():
        text = ["first ques # on stackoverflow", "please help"]
        vec = CountVectorizer(tokenizer=MyTokenizer(), analyzer = 'word')
        pos_vector = vec.fit_transform(text).toarray()
        print(vec.get_feature_names())`

output : [u' ', u'a', u'e', u'f', u'h', u'i', u'l', u'p', u'q', u'r', u's', u't', u'u']

Expected Output : [u'first', u'ques', u'please', u'help']

589

asked Aug 02 '16 08:08

Rashmi Singh

2 Answers

You could split on your separator(#) at most once and take the first part of the split.

from sklearn.feature_extraction.text import CountVectorizer

def tokenize(text):
    return([text.split('#', 1)[0].strip()])

text = ["first ques # on stackoverflow", "please help"]

vec = CountVectorizer(tokenizer=tokenize)
data = vec.fit_transform(text).toarray()
vocab = vec.get_feature_names()

required_list = []
for word in vocab:
    required_list.extend(word.split())
print(required_list)

#['first', 'ques', 'please', 'help']

108

answered Sep 24 '22 13:09

Nickil Maveli

The problem lays with your tokenizer, you've split the string into the bits you want to keep and the bits you don't want to keep, but you've not split the string into words. Try using the tokenizer below

class MyTokenizer(object):
    def __call__(self,s):
        if(s.find('#')==-1):
            return s.split(' ')
        else:
            return s.split('#')[0].split(' ')

answered Sep 25 '22 13:09

piman314

Related questions
                            
                                How to use TensorFlow gradient descent optimizer to solve optimization problems
                            
                                Python 3: write method vs. os.write number of bytes returned
                            
                                Dimension Mismatch in LSTM Keras
                            
                                What is the difference between StringIO and ByteIO?
                            
                                Can someone give a python requests example of uploading a release asset in github?
                            
                                Force pandas xaxis datetime index using a specific format
                            
                                AttributeError in python-rtmidi sample code
                            
                                How can I change an attribute value in the DOM using Selenium and Python
                            
                                pyspark, Compare two rows in dataframe
                            
                                How do I delete a similar alembic version?
                            
                                How to make Celery worker return results from task
                            
                                Numerical Laplace transform python
                            
                                AttributeError: module 'socket' has no attribute 'AF_PACKET'
                            
                                Intermediate results from joblib
                            
                                How to read timezone aware datetimes as a timezone naive local DatetimeIndex with read_csv in pandas?
                            
                                Listing users for certain DB with PyMongo
                            
                                How to find the diameter of objects using image processing in Python?
                            
                                filtering dataframe on multiple conditions
                            
                                How to Sort Two Columns by Descending Order in Pandas?
                            
                                How do I trim a .fits image and keep world coordinates for plotting in astropy Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scikit Learn - Extract word tokens from a string delimiter using CountVectorizer

Tags:

python

machine-learning

scikit-learn

Rashmi Singh

People also ask

2 Answers

Nickil Maveli

piman314

Recent Activity

Donate For Us