Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Translation of Single Words, Taking into Account Context, using Computer Language Processing Tools

I would like to automatically annotate texts for learners of foreign languages with translations of difficult words.

For instance, if the original text is:

El gato esta en la casa de mis vecinos

Becomes

El gato esta en la casa de mis vecinos (neighbours)

The first step is to identify which words are the difficult ones. This could be done by lemmatization of the words in the original text and comparing them with a list of 'easy words' (a basic vocabulary of 1500-2000 words). Those not found in this list will be designated as 'hard words.' This process seems straightforward enough using the Natural Language Tool Kit (NLTK) for Python.

There is some difficulty in words that must be translated as a pair, such as 'newly weds,' or phrasal verbs 'he called me up' or the German 'er ruft mich an' (anrufen). Here words can't be treated individually. For phrasal verbs and the like perhaps some understanding of grammer is needed.

The second step involves obtaining a correct translation of the difficult words according to context in which they appear. As I understand, this is effectively applying the first half of a statistical machine translation system like google translate. I believe this problem could solved using the Google Translate Research API, that lets you send text to be translated, and the response includes information about which word in the translation corresponds to which word in the original text. So you could feed in the whole sentence and then fish out the word you wanted from the response. You have to apply to use this API however, and they have usage limits, which would likely be a problem for my application. I would rather find another solution. I expect no solution will give 100% correct translations and they will have to be checked by hand, but this should still speed things up.

Thanks for your comments.

David

like image 365
Davidw Avatar asked Nov 14 '22 01:11

Davidw


1 Answers

For the initial step, there is no need to rely on a priori vocabulary - simply accumulating token counts in a training corpus and marking the tokens in your test set that do not occur before a cutoff point in the rank-ordered vocabulary should suffice.

http://vuw.academia.edu/JosephSorell/Papers/549885/Zipfs_Law_and_Vocabulary

For the second step, "obtaining a correct translation of the difficult words according to context in which they appear", yes, you would need access to a MT API and/or human translation. Choosing the best approach depends on your goals.

You can have a correct translation, a fast translation, or a cheap translation - I know of no way you can have all three simultaneously.

like image 154
Sean W Avatar answered Feb 26 '23 22:02

Sean W