I'm new to Machine Learning and Tensorflow, since I don't know python so I decide to use there javascript version (maybe more like a wrapper).
The problem is I tried to build a model that process the Natural Language. So the first step is tokenizer the text in order to feed the data to model. I did a lot research, but most of them are using python version of tensorflow that use method like: tf.keras.preprocessing.text.Tokenizer
which I can't find similar in tensorflow.js. I'm stuck in this step and don't know how can I transfer text to vector that can feed to model. Please help :)
Overview. Tokenization is the process of breaking up a string into tokens. Commonly, these tokens are words, numbers, and/or punctuation. The tensorflow_text package provides a number of tokenizers available for preprocessing text required by your text-based models.
A tokenizer splits the input into a stream of larger units called tokens. This happens in a separate stage before parsing. For example, a tokenizer might convert 512 + 10 into ["512", "+", "10"] : notice how it removed the whitespace, and combined multi-digit numbers into a single number.
texts_to_sequences. View source texts_to_sequences( texts ) Transforms each text in texts to a sequence of integers. Only top num_words-1 most frequent words will be taken into account. Only words known by the tokenizer will be taken into account.
One such subword tokenization technique that is commonly used and can be applied to many other NLP models is called WordPiece. Given text, WordPiece first pre-tokenizes the text into words (by splitting on punctuation and whitespaces) and then tokenizes each word into subword units, called wordpieces.
To transform text to vectors, there are lots of ways to do it, all depending on the use case. The most intuitive one, is the one using the term frequency, i.e , given the vocabulary of the corpus (all the words possible), all text document will be represented as a vector where each entry represents the occurrence of the word in text document.
With this vocabulary :
["machine", "learning", "is", "a", "new", "field", "in", "computer", "science"]
the following text:
["machine", "is", "a", "field", "machine", "is", "is"]
will be transformed as this vector:
[2, 0, 3, 1, 0, 1, 0, 0, 0]
One of the disadvantage of this technique is that there might be lots of 0 in the vector which has the same size as the vocabulary of the corpus. That is why there are others techniques. However the bag of words is often referred to. And there is a slight different version of it using tf.idf
const vocabulary = ["machine", "learning", "is", "a", "new", "field", "in", "computer", "science"]
const text = ["machine", "is", "a", "field", "machine", "is", "is"]
const parse = (t) => vocabulary.map((w, i) => t.reduce((a, b) => b === w ? ++a : a , 0))
console.log(parse(text))
There is also the following module that might help to achieve what you want
Well, I faced this issue and handled it by following below steps:
tokenizer.fit_on_texts([data])
print tokenizer.word_index
in your python code.
function getTokenisedWord(seedWord) {
const _token = word2index[seedWord.toLowerCase()]
return tf.tensor1d([_token])
}
const seedWordToken = getTokenisedWord('Hello');
model.predict(seedWordToken).data().then(predictions => {
const resultIdx = tf.argMax(predictions).dataSync()[0];
console.log('Predicted Word ::', index2word[resultIdx]);
})
index2word
is the reverse mapping of word2index
json object.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With