I understand that GloVe trains vectors by noticing what frequently co-occurs, etc, but how come commas and periods are not included? For anything NLP, it seems like it would be an important feature to have a vector representation. I realize that something like (king - man = queen) would make no sense with (word - , = ?), but is there a way to represent punctuation marks and Numbers?
Is there a pre-made data set that includes such things? Would this even work?
I tried training GloVe with my own data set, but I ran into a problem with separating the punctuation (with a blank space) between words, etc.
pre-trained GloVe vectors do have punctuation, what makes you think they don't? At least Wikipedia 2014 + Gigaword 5 (6B tokens) set from http://nlp.stanford.edu/projects/glove/ have embeddings for "," ".", "-" and other included, just download these word vectors, and verify it yourseld, they are in plain text format, so its easy to do.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With