Trying to convert string into numeric vector,
### Clean the string
def names_to_words(names):
print('a')
words = re.sub("[^a-zA-Z]"," ",names).lower().split()
print('b')
return words
### Vectorization
def Vectorizer():
Vectorizer= CountVectorizer(
analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
max_features = 5000)
return Vectorizer
### Test a string
s = 'abc...'
r = names_to_words(s)
feature = Vectorizer().fit_transform(r).toarray()
But when I encoutered:
['g', 'o', 'm', 'd']
There's error:
ValueError: empty vocabulary; perhaps the documents only contain stop words
It seems there's a problem with such single-letter string. what should I do? Thx
The CountVectorizer will select the words/features/terms which occur the most frequently. It takes absolute values so if you set the 'max_features = 3', it will select the 3 most common words in the data. By setting 'binary = True', the CountVectorizer no more takes into consideration the frequency of the term/word.
What is CountVectorizer In NLP? CountVectorizer means breaking down a sentence or any text into words by performing preprocessing tasks like converting all words to lowercase, thus removing special characters.
CountVectorizer will tokenize the data and split it into chunks called n-grams, of which we can define the length by passing a tuple to the ngram_range argument. For example, 1,1 would give us unigrams or 1-grams such as “whey” and “protein”, while 2,2 would give us bigrams or 2-grams, such as “whey protein”.
To do that, you should replace the numbers with something generic like NUM , Before applying the CountVectorizer . Then you can apply the CountVectorizer . After that, the numbers in the output of print(cv. get_feature_names()) will be replaced by a single NUM .
The default token_pattern regexp in CountVectorizer selects words which have atleast 2 chars as stated in documentation:
token_pattern : string
Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
From the source code of CountVectorizer it is r"(?u)\b\w\w+\b
Change it to r"(?u)\b\w+\b
to include 1 letter words.
Change your code to the following (include the token_pattern
parameter with above suggestion):
Vectorizer= CountVectorizer(
analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
max_features = 5000,
token_pattern = r"(?u)\b\w+\b")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With