Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stanford GloVe's lack of punctuation?

I understand that GloVe trains vectors by noticing what frequently co-occurs, etc, but how come commas and periods are not included? For anything NLP, it seems like it would be an important feature to have a vector representation. I realize that something like (king - man = queen) would make no sense with (word - , = ?), but is there a way to represent punctuation marks and Numbers?

Is there a pre-made data set that includes such things? Would this even work?

I tried training GloVe with my own data set, but I ran into a problem with separating the punctuation (with a blank space) between words, etc.

like image 933
Nate Cook3 Avatar asked Nov 27 '25 10:11

Nate Cook3


1 Answers

pre-trained GloVe vectors do have punctuation, what makes you think they don't? At least Wikipedia 2014 + Gigaword 5 (6B tokens) set from http://nlp.stanford.edu/projects/glove/ have embeddings for "," ".", "-" and other included, just download these word vectors, and verify it yourseld, they are in plain text format, so its easy to do.

like image 110
Denis Tarasov Avatar answered Nov 30 '25 04:11

Denis Tarasov



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!