What is UNK Token in Vector Representation of Words

Tags:

tensorflow

# Step 2: Build the dictionary and replace rare words with UNK token.
vocabulary_size = 50000


def build_dataset(words, n_words):
  """Process raw inputs into a dataset."""
  count = [['UNK', -1]]
  count.extend(collections.Counter(words).most_common(n_words - 1))
  dictionary = dict()
  for word, _ in count:
    dictionary[word] = len(dictionary)
  data = list()
  unk_count = 0
  for word in words:
    if word in dictionary:
      index = dictionary[word]
    else:
      index = 0  # dictionary['UNK']
      unk_count += 1
    data.append(index)
  count[0][1] = unk_count
  reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
  return data, count, dictionary, reversed_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(vocabulary,
                                                            vocabulary_size)

I am learning the elementary example of Vector Representation of Words using Tensorflow.

This Step 2 is titled as "Build the dictionary and replace rare words with UNK token", however, there's no prior defining process of what "UNK" refers to.

To specify the question:

0) What does UNK generally refer to in NLP?

1) What does count = [['UNK', -1]] mean? I know the bracket [] refer to list in python, however, why do we collocating it with -1?

855

asked Aug 17 '17 12:08

1 Answers

As it is already mentioned in the comments, in tokenizing and NLP when you see the UNK token, it is probably to indicate unknown word.

for example, if you want to predict a missing word in a sentence. how would you feed your data to it? you definitely need a token for showing that where is the missing word. so if the "house" is our missing word, after tokenizing it will be like:

'my house is big' -> ['my', 'UNK', 'is', 'big']

PS: that count = [['UNK', -1]] is for initionalizing the count, and it will be like [['word', number_of_occurences]] as Ivan Aksamentov has already said.

143

answered Sep 25 '22 00:09

Peyman

Related questions
                            
                                InvalidArgumentError: Mismatch between the current graph and the graph from the checkpoint
                            
                                Tensorflow Lite GPU support for python
                            
                                Make TensorFlow use training data generated on-the-fly by custom CUDA routine
                            
                                How to create a tensorflow serving client for the 'wide and deep' model?
                            
                                TensorFlow tf.reshape Fortran order (like numpy)
                            
                                Tensorflow Object Detection API on Windows - error "ModuleNotFoundError: No module named 'utils'"
                            
                                Tensorflow Estimator - warm_start_from and model_dir
                            
                                Get Keras model input from inside a custom callback
                            
                                Tensorflow installation using SSE instructions with pip
                            
                                Exception CallbackOnCollectedDelegate when creating tensorflow graph
                            
                                What does the property losses of the Bayesian layers of TensorFlow Probability represent?
                            
                                Loss Function is decreasing but metric function remains constant?
                            
                                Tensorflow: device CUDA:0 not supported by XLA service while setting up XLA_GPU_JIT device number 0
                            
                                Ways to implement multi-GPU BN layers with synchronizing means and vars
                            
                                How to do multi GPU training with Keras?
                            
                                Logging requests being served by tensorflow serving model
                            
                                What is meant by static monolithic build when building tensorflow from source?
                            
                                Speech to text using TensorFlow [closed]
                            
                                Why does tf.Print() does not print in tensorflow
                            
                                tf.nn.depthwise_conv2d is too slow. is it normal?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is UNK Token in Vector Representation of Words

Tags:

tensorflow

Beverlie

People also ask

1 Answers

Peyman

Recent Activity

Donate For Us