Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sentence embeddings for extremely short texts (1-3 words/sentence)

I have some texts which are extremely short which comes from banktransactions (80% of the dataset has less than 3 words), and I want to classify them into ~90.000 classes (supplier). Since the text comes from bank-transactions a lot of words are not exact but often truncated, missspelled etc, thus BoW is not optimal since we have words as facebk, facebook, faceboo which are treated as different words. FastText is a good candidate for creating embeddings, and word for word it creates some very good similarities.

The issue is when I want to create a sentence embedding. Since banktransactions often have (a lot) of "stop-words" eg. facebook/zg181239 202392 or MC 102930 it is extremly difficult to remove all "rubbish" words.

Using the mean of all the words as a sentence embedding does not work very well, since some transactions contain rubbish words, and since the sentences are so short those "noise" words have a big influence e.g faceboo 1230jaisd and facebook 1231jaikj are not very similar in the "sentence" space.

Is there a way to make some kind of sentence embedding that could handle this?

EDIT:

I tackled it by using the SentenceTransformers library. I then finetuned an already existing model on (text, supplier) pairs (using the MNR-loss) and used those embeddings for downstream tasks. It works very well!

like image 890
CutePoison Avatar asked Oct 25 '25 14:10

CutePoison


1 Answers

With 90,000 classes and lots of thin/junk data, there may not be enough of a pattern there to learn a good classifier. (Do the classes at least have many diverse examples each?)

Still, some ideas:

  1. If certain tokens or even token-fragments are reliable indicators that items must/must-not be from certain suppliers, a classifier based on the more-discrete features of a bag-of-words, or bag-of-character-n-grams, representation might do better. (Or, those features might help in places where the continuous word-vectors, alone or when combined into even fuzzier sums/means, have the greatest problems. That is, even your example of typos like facebk and faceboo still include the 4-character n-gram face, and the binary presence/absence of that fragment, in a sparse BoW/Bag-of-ngrams representation, may be a clearer signal to a downstream classifier than the continuous FastText n-gram vectors that'd be learned for face, after they've been mixed into the OOV vector synthesis than the multitoken average.)

  2. If you suspect it's the particular 'noise' words causing problems – only very rarely helping a little, you could try things like:

    a. discarding all tokens below a certain frequency, that threshold TBD via experimentation;

    b. down-weight vectors for tokens that are below a certain frequency, or that have been OOV-synthesized via FastText's ngrams approach, compared to the vectors for more-common tokens, when averaging those together.

    c. represent each item as concatenation of the average of the frequent/known tokens, and a separate average of the sketchy/lower-frequency/OOV-synthesized tokens - so that those tokens you see as junk/noise don't necessarily dilute the better tokens, but are still contributing for the tougher cases. (You might also try feature-subsegments based on all-alpha or all-numeric or other subsets of the tokens.)

  3. Both FastText itself & any downsream classifiers may have parameter-tuning options that could help, especially given that your data isn't true natural-language, and thus will have token/subword frequencies/patterns quite different from usual situations. (As one example: lots of arbitrary numbers are highly unlikely to have the same sort of vaguely-meaningful substrings as natural language, so maybe far higher or lower values of the buckets value would work better, or keeping most numeric info out of the word-vector modeling entirely.)

like image 68
gojomo Avatar answered Oct 28 '25 05:10

gojomo