I have a list of strings that look like this:
df_train = ['Hello John-Smith it is nine o'clock','This is a completely random-sequence']
I would like sklearn TfidfVectorizer to treat words joined with a hyphen as a whole word. When I apply the following code, the words separated by hyphen (or other punctuation) are treated as separate words:
vectorizer_train = TfidfVectorizer(analyzer = 'word',
min_df=0.0,
max_df = 1.0,
strip_accents = None,
encoding = 'utf-8',
preprocessor=None,
token_pattern=r"(?u)\b\w\w+\b")
vectorizer_train.fit_transform(df_train)
vectorizer_train.get_feature_names()
I have changed the parameter token_pattern but with no success. Any idea of how I could solve this issue? In addition, is it possible to treat as a single entity words that are separated by any punctuation? (e.g. 'Hi.there How_are you:doing')
It seems you need to split on white space only, try switch the pattern to (?u)\S\S+, which captures consecutive non white space characters as a single word:
df_train = ["Hello John-Smith it is nine o'clock",
"This is a completely random-sequence",
"Hi.there How_are you:doing"]
vectorizer_train = TfidfVectorizer(analyzer = 'word',
min_df=0.0,
max_df = 1.0,
strip_accents = None,
encoding = 'utf-8',
preprocessor=None,
token_pattern=r"(?u)\S\S+")
vectorizer_train.fit_transform(df_train)
vectorizer_train.get_feature_names()
gives:
['completely',
'hello',
'hi.there',
'how_are',
'is',
'it',
'john-smith',
'nine',
"o'clock",
'random-sequence',
'this',
'you:doing']
To respect hyphenated compounds only, you can use (?u)\b\w[\w-]*\w\b:
['clock',
'completely',
'doing',
'hello',
'hi',
'how_are',
'is',
'it',
'john-smith',
'nine',
'random-sequence',
'there',
'this',
'you']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With