Pass tokens to CountVectorizer

Question

I have a text classification problem where i have two types of features:

features which are n-grams (extracted by CountVectorizer)
other textual features (e.g. presence of a word from a given lexicon). These features are different from n-grams since they should be a part of any n-gram extracted from the text.

Both types of features are extracted from the text's tokens. I want to run tokenization only once,and then pass these tokens to CountVectorizer and to the other presence features extractor. So, i want to pass a list of tokens to CountVectorizer, but is only accepts a string as a representation to some sample. Is there a way to pass an array of tokens?

vladkha · Accepted Answer

Summarizing the answers of @user126350 and @miroli and this link:

from sklearn.feature_extraction.text import CountVectorizer

def dummy(doc):
    return doc

cv = CountVectorizer(
        tokenizer=dummy,
        preprocessor=dummy,
    )  

docs = [
    ['hello', 'world', '.'],
    ['hello', 'world'],
    ['again', 'hello', 'world']
]

cv.fit(docs)
cv.get_feature_names()
# ['.', 'again', 'hello', 'world']

The one thing to keep in mind is to wrap the new tokenized document into a list before calling the transform() function so that it is handled as a single document instead of interpreting each token as a document:

new_doc = ['again', 'hello', 'world', '.']
v_1 = cv.transform(new_doc)
v_2 = cv.transform([new_doc])

v_1.shape
# (4, 4)

v_2.shape
# (1, 4)

Pass tokens to CountVectorizer

Tags:

tokenize

scikit-learn

Yonanam

Video Answer

1 Answers

vladkha

Recent Activity

Donate For Us

Pass tokens to CountVectorizer

Tags:

tokenize

scikit-learn

Yonanam

Video Answer

1 Answers

vladkha

Related questions

Recent Activity

Donate For Us