Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to treat numbers inside text strings when vectorizing words?

If I have a text string to be vectorized, how should I handle numbers inside it? Or if I feed a Neural Network with numbers and words, how can I keep the numbers as numbers?

I am planning on making a dictionary of all my words (as suggested here). In this case all strings will become arrays of numbers. How should I handle characters that are numbers? how to output a vector that does not mix the word index with the number character?

Does converting numbers to strings weakens the information i feed the network?

like image 724
Rikard Avatar asked Jul 01 '17 22:07

Rikard


People also ask

What is vectorization of text?

Text Vectorization. Text Vectorization is the process of converting text into numerical representation. Here is some popular methods to accomplish text vectorization: Binary Term Frequency; Bag of Words (BoW) Term Frequency (L1) Normalized Term Frequency (L2) Normalized TF-IDF; Word2Vec

Is the bag of words approach for text vectorization a good idea?

One of the problems of the bag of words approach for text vectorization is that for each new problem that you face, you need to do all the vectorization from scratch. Humans don’t have this problem; we know that certain words have particular meanings, and we know that these meanings may change in different contexts.

Is there a way to vectorize words with deep learning?

Since deep learning has taken over the machine learning field, there have been many attempts to change the way text vectorization is done and find better ways to represent text. One of the first steps that were taken to solve this problem was to find a way to vectorize words, which became very popular with the word2vec implementation back in 2013.

How to convert numeric values to text in Excel?

Navigate to the Data tab in and click on the Text to Columns icon. Just click through steps 1 and 2. On the third step of the wizard, make sure you select the Text radio button. Press Finish to see your numbers immediately turn into text. I hope the tips and tricks from this article will help you in your work with numeric values in Excel.


Video Answer


2 Answers

Expanding your discussion with @user1735003 - Lets consider both ways of representing numbers:

  1. Treating it as string and considering it as another word and assign an ID to it when forming a dictionary. Or
  2. Converting the numbers to actual words : '1' becomes 'one', '2' as 'two' and so on.

Does the second one change the context in anyway?. To verify it we can find similarity of two representations using word2vec. The scores will be high if they have similar context.

For example, 1 and one have a similarity score of 0.17, 2 and two have a similarity score of 0.23. They seem to suggest that the context of how they are used is totally different.

By treating the numbers as another word, you are not changing the context but by doing any other transformation on those numbers, you can't guarantee its for better. So, its better to leave it untouched and treat it as another word.

Note: Both word-2-vec and glove were trained by treating the numbers as strings (case 1).

like image 65
vijay m Avatar answered Oct 26 '22 10:10

vijay m


The link you provide suggests that everything resulting from a .split(' ') is indexed -- words, but also numbers, possibly smileys, aso. (I would still take care of punctuation marks). Unless you have more prior knowledge about your data or your problem you could start with that.

EDIT

Example literally using your string and their code:

corpus = {'my car number 3'}
dictionary = {}
i = 1
for tweet in corpus:
  for word in tweet.split(" "):
    if word not in dictionary: dictionary[word] = i
    i += 1
print(dictionary)
# {'my': 1, '3': 4, 'car': 2, 'number': 3}
like image 34
P-Gn Avatar answered Oct 26 '22 10:10

P-Gn