Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

keras.preprocessing.text.Tokenizer equivalent in Pytorch?

Tags:

Basically the title; is there any equivalent tokeras.preprocessing.text.Tokenizer in Pytorch? I have yet to find any that gives all the utilities without handcrafting things.

like image 607
katiex7 Avatar asked Sep 03 '19 08:09

katiex7


1 Answers

I find Torchtext more difficult to use for simple things. PyTorch-NLP can do this in a more straightforward way:

from torchnlp.encoders.text import StaticTokenizerEncoder, stack_and_pad_tensors, pad_tensor

loaded_data = ["now this ain't funny", "so don't you dare laugh"]
encoder = StaticTokenizerEncoder(loaded_data, tokenize=lambda s: s.split())
encoded_data = [encoder.encode(example) for example in loaded_data]

print(encoded_data)

[tensor([5, 6, 7, 8]), tensor([ 9, 10, 11, 12, 13])]

encoded_data = [pad_tensor(x, length=10) for x in encoded_data]
print(stack_and_pad_tensors(encoded_data))
# alternatively, use encoder.batch_encode()

BatchedSequences(tensor=tensor([[ 5, 6, 7, 8, 0, 0, 0, 0, 0, 0], [ 9, 10, 11, 12, 13, 0, 0, 0, 0, 0]]), lengths=tensor([10, 10])) ​

It comes with other types of encoders, such as spaCy's tokenizer, subword encoder, etc.

like image 104
Feng Mai Avatar answered Sep 29 '22 09:09

Feng Mai