Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TfidfVectorizer to respect hyphenated compounds (words that are joined with a hyphen)

I have a list of strings that look like this:

df_train = ['Hello John-Smith it is nine o'clock','This is a completely random-sequence']

I would like sklearn TfidfVectorizer to treat words joined with a hyphen as a whole word. When I apply the following code, the words separated by hyphen (or other punctuation) are treated as separate words:

vectorizer_train = TfidfVectorizer(analyzer = 'word',
                                       min_df=0.0,
                                       max_df = 1.0,
                                       strip_accents = None,
                                       encoding = 'utf-8', 
                                       preprocessor=None,
                                       token_pattern=r"(?u)\b\w\w+\b")

vectorizer_train.fit_transform(df_train)
vectorizer_train.get_feature_names()

I have changed the parameter token_pattern but with no success. Any idea of how I could solve this issue? In addition, is it possible to treat as a single entity words that are separated by any punctuation? (e.g. 'Hi.there How_are you:doing')

like image 208
Alejandro Avatar asked Nov 23 '25 22:11

Alejandro


1 Answers

It seems you need to split on white space only, try switch the pattern to (?u)\S\S+, which captures consecutive non white space characters as a single word:

df_train = ["Hello John-Smith it is nine o'clock",
            "This is a completely random-sequence", 
            "Hi.there How_are you:doing"]

vectorizer_train = TfidfVectorizer(analyzer = 'word',
                                       min_df=0.0,
                                       max_df = 1.0,
                                       strip_accents = None,
                                       encoding = 'utf-8', 
                                       preprocessor=None,
                                       token_pattern=r"(?u)\S\S+")
​
vectorizer_train.fit_transform(df_train)
vectorizer_train.get_feature_names()

gives:

['completely',
 'hello',
 'hi.there',
 'how_are',
 'is',
 'it',
 'john-smith',
 'nine',
 "o'clock",
 'random-sequence',
 'this',
 'you:doing']

To respect hyphenated compounds only, you can use (?u)\b\w[\w-]*\w\b:

['clock',
 'completely',
 'doing',
 'hello',
 'hi',
 'how_are',
 'is',
 'it',
 'john-smith',
 'nine',
 'random-sequence',
 'there',
 'this',
 'you']
like image 132
Psidom Avatar answered Nov 25 '25 12:11

Psidom