Empty vocabulary for single letter by CountVectorizer

Tags:

Trying to convert string into numeric vector,

### Clean the string
def names_to_words(names):
    print('a')
    words = re.sub("[^a-zA-Z]"," ",names).lower().split()
    print('b')

    return words


### Vectorization
def Vectorizer():
    Vectorizer= CountVectorizer(
                analyzer = "word",  
                tokenizer = None,  
                preprocessor = None, 
                stop_words = None,  
                max_features = 5000)
    return Vectorizer  


### Test a string
s = 'abc...'
r = names_to_words(s)
feature = Vectorizer().fit_transform(r).toarray()

But when I encoutered:

 ['g', 'o', 'm', 'd']

There's error:

ValueError: empty vocabulary; perhaps the documents only contain stop words

It seems there's a problem with such single-letter string. what should I do？ Thx

942

asked Apr 25 '17 04:04

LookIntoEast

1 Answers

The default token_pattern regexp in CountVectorizer selects words which have atleast 2 chars as stated in documentation:

token_pattern : string

Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).

From the source code of CountVectorizer it is r"(?u)\b\w\w+\b

Change it to r"(?u)\b\w+\b to include 1 letter words.

Change your code to the following (include the token_pattern parameter with above suggestion):

Vectorizer= CountVectorizer(
                analyzer = "word",  
                tokenizer = None,  
                preprocessor = None, 
                stop_words = None,  
                max_features = 5000,
                token_pattern = r"(?u)\b\w+\b")

175

answered Oct 18 '22 17:10

Vivek Kumar

Related questions
                            
                                Gaussian Fit on noisy and 'interesting' data set
                            
                                Running multiple services using dev_appserver.py on different ports
                            
                                Incremental model update with PyMC3
                            
                                Append empty rows to Dataframe in pandas
                            
                                Python ftplib.error_perm: 530 Login authentication failed
                            
                                Pandas, filter rows which column contain another column
                            
                                What is the difference between `sys.meta_path` and `sys.path_hooks` importer objects?
                            
                                Google Sheets API Python - Clear sheet
                            
                                rpy2 doesn't work - requires libiconv.so.2
                            
                                Return Pandas dataframe as JSONP response in Python Flask
                            
                                delimiter of tab '\t' of csv.writer in python
                            
                                Python/Pandas: How to Match List of Strings with a DataFrame column
                            
                                Pandas DataFrame Table Vertical Scrollbars
                            
                                Coefficient of Variation and NumPy
                            
                                how to keep numpy array when saving pandas dataframe to csv
                            
                                Comparing two date objects in Python: TypeError: '<' not supported between instances of 'datetime.date' and 'method'
                            
                                boolean operation with groupby in pandas
                            
                                Opposite of __init__ in thread class?
                            
                                Jupyter can't find keras' module
                            
                                import _tkinter # If this fails your Python may not be configured for Tk

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Empty vocabulary for single letter by CountVectorizer

Tags:

python

vectorization

nlp

feature-extraction

countvectorizer

LookIntoEast

People also ask

1 Answers

Vivek Kumar

Recent Activity

Donate For Us