Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is UNK Token in Vector Representation of Words

Tags:

tensorflow

# Step 2: Build the dictionary and replace rare words with UNK token.
vocabulary_size = 50000


def build_dataset(words, n_words):
  """Process raw inputs into a dataset."""
  count = [['UNK', -1]]
  count.extend(collections.Counter(words).most_common(n_words - 1))
  dictionary = dict()
  for word, _ in count:
    dictionary[word] = len(dictionary)
  data = list()
  unk_count = 0
  for word in words:
    if word in dictionary:
      index = dictionary[word]
    else:
      index = 0  # dictionary['UNK']
      unk_count += 1
    data.append(index)
  count[0][1] = unk_count
  reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
  return data, count, dictionary, reversed_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(vocabulary,
                                                            vocabulary_size)

I am learning the elementary example of Vector Representation of Words using Tensorflow.

This Step 2 is titled as "Build the dictionary and replace rare words with UNK token", however, there's no prior defining process of what "UNK" refers to.

To specify the question:

0) What does UNK generally refer to in NLP?

1) What does count = [['UNK', -1]] mean? I know the bracket [] refer to list in python, however, why do we collocating it with -1?

like image 855
Beverlie Avatar asked Aug 17 '17 12:08

Beverlie


People also ask

What is token to vector?

Tokens are first generated from the corpus and vocabulary is built to map the tokens to their corresponding ids. Instead of building one vector for each token, a count vector simply counts how many times each token appeared in a sentence and places that number in their corresponding position in the vector.

What is token ID in NLP?

The token ids are indices in a vocabulary, in your case indices in a sub-word vocabulary. The ids themselves are not used during the training of a network, rather the ids are transformed into vectors. Say you are inputting three words, and their ids are 12,14, and 4.

What is word tokenization?

Word tokenization is the process of splitting a large sample of text into words. This is a requirement in natural language processing tasks where each word needs to be captured and subjected to further analysis like classifying and counting them for a particular sentiment etc.


1 Answers

As it is already mentioned in the comments, in tokenizing and NLP when you see the UNK token, it is probably to indicate unknown word.

for example, if you want to predict a missing word in a sentence. how would you feed your data to it? you definitely need a token for showing that where is the missing word. so if the "house" is our missing word, after tokenizing it will be like:

'my house is big' -> ['my', 'UNK', 'is', 'big']

PS: that count = [['UNK', -1]] is for initionalizing the count, and it will be like [['word', number_of_occurences]] as Ivan Aksamentov has already said.

like image 143
Peyman Avatar answered Sep 25 '22 00:09

Peyman