# Step 2: Build the dictionary and replace rare words with UNK token.
vocabulary_size = 50000
def build_dataset(words, n_words):
"""Process raw inputs into a dataset."""
count = [['UNK', -1]]
count.extend(collections.Counter(words).most_common(n_words - 1))
dictionary = dict()
for word, _ in count:
dictionary[word] = len(dictionary)
data = list()
unk_count = 0
for word in words:
if word in dictionary:
index = dictionary[word]
else:
index = 0 # dictionary['UNK']
unk_count += 1
data.append(index)
count[0][1] = unk_count
reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
return data, count, dictionary, reversed_dictionary
data, count, dictionary, reverse_dictionary = build_dataset(vocabulary,
vocabulary_size)
I am learning the elementary example of Vector Representation of Words using Tensorflow.
This Step 2 is titled as "Build the dictionary and replace rare words with UNK token", however, there's no prior defining process of what "UNK" refers to.
To specify the question:
0) What does UNK generally refer to in NLP?
1) What does count = [['UNK', -1]] mean? I know the bracket [] refer to list in python, however, why do we collocating it with -1?
Tokens are first generated from the corpus and vocabulary is built to map the tokens to their corresponding ids. Instead of building one vector for each token, a count vector simply counts how many times each token appeared in a sentence and places that number in their corresponding position in the vector.
The token ids are indices in a vocabulary, in your case indices in a sub-word vocabulary. The ids themselves are not used during the training of a network, rather the ids are transformed into vectors. Say you are inputting three words, and their ids are 12,14, and 4.
Word tokenization is the process of splitting a large sample of text into words. This is a requirement in natural language processing tasks where each word needs to be captured and subjected to further analysis like classifying and counting them for a particular sentiment etc.
As it is already mentioned in the comments, in tokenizing and NLP when you see the UNK
token, it is probably to indicate unknown word.
for example, if you want to predict a missing word in a sentence. how would you feed your data to it? you definitely need a token for showing that where is the missing word. so if the "house" is our missing word, after tokenizing it will be like:
'my house is big'
->['my', 'UNK', 'is', 'big']
PS: that count = [['UNK', -1]]
is for initionalizing the count
, and it will be like [['word', number_of_occurences]]
as Ivan Aksamentov has already said.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With