Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Empty vocabulary for single letter by CountVectorizer

Trying to convert string into numeric vector,

### Clean the string
def names_to_words(names):
    print('a')
    words = re.sub("[^a-zA-Z]"," ",names).lower().split()
    print('b')

    return words


### Vectorization
def Vectorizer():
    Vectorizer= CountVectorizer(
                analyzer = "word",  
                tokenizer = None,  
                preprocessor = None, 
                stop_words = None,  
                max_features = 5000)
    return Vectorizer  


### Test a string
s = 'abc...'
r = names_to_words(s)
feature = Vectorizer().fit_transform(r).toarray()

But when I encoutered:

 ['g', 'o', 'm', 'd']

There's error:

ValueError: empty vocabulary; perhaps the documents only contain stop words

It seems there's a problem with such single-letter string. what should I do? Thx

like image 942
LookIntoEast Avatar asked Apr 25 '17 04:04

LookIntoEast


People also ask

How do you do a CountVectorizer?

The CountVectorizer will select the words/features/terms which occur the most frequently. It takes absolute values so if you set the 'max_features = 3', it will select the 3 most common words in the data. By setting 'binary = True', the CountVectorizer no more takes into consideration the frequency of the term/word.

What does CountVectorizer do in NLP?

What is CountVectorizer In NLP? CountVectorizer means breaking down a sentence or any text into words by performing preprocessing tasks like converting all words to lowercase, thus removing special characters.

What is Ngram_range in CountVectorizer?

CountVectorizer will tokenize the data and split it into chunks called n-grams, of which we can define the length by passing a tuple to the ngram_range argument. For example, 1,1 would give us unigrams or 1-grams such as “whey” and “protein”, while 2,2 would give us bigrams or 2-grams, such as “whey protein”.

How do I remove numbers from CountVectorizer?

To do that, you should replace the numbers with something generic like NUM , Before applying the CountVectorizer . Then you can apply the CountVectorizer . After that, the numbers in the output of print(cv. get_feature_names()) will be replaced by a single NUM .


1 Answers

The default token_pattern regexp in CountVectorizer selects words which have atleast 2 chars as stated in documentation:

token_pattern : string

Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).

From the source code of CountVectorizer it is r"(?u)\b\w\w+\b

Change it to r"(?u)\b\w+\b to include 1 letter words.

Change your code to the following (include the token_pattern parameter with above suggestion):

Vectorizer= CountVectorizer(
                analyzer = "word",  
                tokenizer = None,  
                preprocessor = None, 
                stop_words = None,  
                max_features = 5000,
                token_pattern = r"(?u)\b\w+\b")
like image 175
Vivek Kumar Avatar answered Oct 18 '22 17:10

Vivek Kumar