I would like to automatically annotate texts for learners of foreign languages with translations of difficult words.
For instance, if the original text is:
El gato esta en la casa de mis vecinos
Becomes
El gato esta en la casa de mis vecinos (neighbours)
The first step is to identify which words are the difficult ones. This could be done by lemmatization of the words in the original text and comparing them with a list of 'easy words' (a basic vocabulary of 1500-2000 words). Those not found in this list will be designated as 'hard words.' This process seems straightforward enough using the Natural Language Tool Kit (NLTK) for Python.
There is some difficulty in words that must be translated as a pair, such as 'newly weds,' or phrasal verbs 'he called me up' or the German 'er ruft mich an' (anrufen). Here words can't be treated individually. For phrasal verbs and the like perhaps some understanding of grammer is needed.
The second step involves obtaining a correct translation of the difficult words according to context in which they appear. As I understand, this is effectively applying the first half of a statistical machine translation system like google translate. I believe this problem could solved using the Google Translate Research API, that lets you send text to be translated, and the response includes information about which word in the translation corresponds to which word in the original text. So you could feed in the whole sentence and then fish out the word you wanted from the response. You have to apply to use this API however, and they have usage limits, which would likely be a problem for my application. I would rather find another solution. I expect no solution will give 100% correct translations and they will have to be checked by hand, but this should still speed things up.
Thanks for your comments.
David
For the initial step, there is no need to rely on a priori vocabulary - simply accumulating token counts in a training corpus and marking the tokens in your test set that do not occur before a cutoff point in the rank-ordered vocabulary should suffice.
http://vuw.academia.edu/JosephSorell/Papers/549885/Zipfs_Law_and_Vocabulary
For the second step, "obtaining a correct translation of the difficult words according to context in which they appear", yes, you would need access to a MT API and/or human translation. Choosing the best approach depends on your goals.
You can have a correct translation, a fast translation, or a cheap translation - I know of no way you can have all three simultaneously.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With