Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I translate using HuggingFace from Chinese to English?

I want to translate from Chinese to English using HuggingFace's transformers using a pretrained "xlm-mlm-xnli15-1024" model. This tutorial shows how to do it from English to German.

I tried following the tutorial but it doesn't detail how to manually change the language or to decode the result. I am lost on where to start. Sorry that this question could not be more specific.

Here is what I tried:

from transformers import AutoModelWithLMHead, AutoTokenizer
base_model = "xlm-mlm-xnli15-1024"
model = AutoModelWithLMHead.from_pretrained(base_model)
tokenizer = AutoTokenizer.from_pretrained(base_model)

inputs = tokenizer.encode("translate English to Chinese: Hugging Face is a technology company based in New York and Paris", return_tensors="pt")
outputs = model.generate(inputs, max_length=40, num_beams=4, early_stopping=True)

print(tokenizer.decode(outputs.tolist()[0]))
'<s>translate english to chinese : hugging face is a technology company based in new york and paris </s>china hug ™ ™ ™ ™ ™ ™ ™ ™ ™ ™ ™ ™ ™ ™ ™ ™ ™'
like image 702
wtwtwt Avatar asked Mar 02 '23 06:03

wtwtwt


2 Answers

This may be helpful. https://huggingface.co/Helsinki-NLP/opus-mt-zh-en

import transformers
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-zh-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-zh-en")
text ='央视春晚,没有最烂,只有更烂'
tokenized_text = tokenizer.prepare_seq2seq_batch([text], return_tensors='pt')
translation = model.generate(**tokenized_text)
translated_text = tokenizer.batch_decode(translation, skip_special_tokens=False)[0]
like image 115
Justin Gerard Avatar answered Apr 07 '23 21:04

Justin Gerard


The model you are mentioning is xlm-mlm-xnli15-1024 can be used for translation, but not in the way that is shown in the link you provide.

The link is specific for T5 model. With XLM model, you only feed the source sentence, but you need to add the language embedding. It is explained in the tutorial for multilingual models. Note also this XLM model is primarily meant to provide crosslingual representation for downstream tasks, so you cannot expect very good translation quality.

like image 43
Jindřich Avatar answered Apr 07 '23 21:04

Jindřich