Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CountVectorizer converts words to lower case

In my classification model, I need to maintain uppercase letters, but when I use sklearn countVectorizer to built the vocabulary, uppercase letters convert to lowercase!

To exclude implicit tokinization, I built a tokenizer which just pass the text without any operation ..

my code:

co = dict()

def tokenizeManu(txt):
    return txt.split()

def corpDict(x):
    print('1: ', x)
    count = CountVectorizer(ngram_range=(1, 1), tokenizer=tokenizeManu)
    countFit = count.fit_transform(x)
    vocab = count.get_feature_names()
    dist = np.sum(countFit.toarray(), axis=0)
    for tag, count in zip(vocab, dist):
        co[str(tag)] = count

x = ['I\'m John Dev', 'We are the only']

corpDict(x)
print(co)

the output:

1:  ["I'm John Dev", 'We are the only'] #<- before building the vocab.
{'john': 1, 'the': 1, 'we': 1, 'only': 1, 'dev': 1, "i'm": 1, 'are': 1} #<- after
like image 906
Minions Avatar asked Mar 20 '18 09:03

Minions


1 Answers

As explained in the documentation, here. CountVectorizer has a parameter lowercase that defaults to True. In order to disable this behavior, you need to set lowercase=False as follows:

count  = CountVectorizer(ngram_range=(1, 1), tokenizer=tokenizeManu, lowercase=False)
like image 92
Mohamed Ali JAMAOUI Avatar answered Sep 24 '22 11:09

Mohamed Ali JAMAOUI