I'm suddenly getting same strange markup when translating phrases in Google Translate API via the Java library. Examples for English → Swedish include:
Vector graphics → vektor~~POS=TRUNC grafikk~~POS=HEADCOMP
Javascript → Javascript script~~POS=HEADCOMP
It looks like it's related to compound noun handling. Is this a feature of the API that I can deactivate somehow or is this a new bug on the server side?
It relies on probability, not accuracy. Google Translate operates off of what is known as Statistical Machine Translation (SMT). Basically, what SMT does is reference all available human-translated documents since 1957 and tries to match strings of texts from one language to another.
Google Translate often produces translations that contain significant grammatical errors. This is due to the fact that Google's translation system uses a method based on language pair frequency that does not take into account grammatical rules. Google Translate does not have a system to correct for translation errors.
This looks like a bug in the server-side translator. I also get it on the web site, https://translate.google.com/#view=home&op=translate&sl=ru&tl=no&text=%D0%9E%D0%B1%D1%89%D0%B5%D0%B6%D0%B8%D1%82%D0%B8%D0%B5 gives me vandrer~~POS=TRUNC
.
In NLP, "POS" means Part-Of-Speech, "HEADCOMP" sounds like it could be the head of a noun-compound, I'm guessing they TRUNCate the non-head parts of compounds (practically never inflected). So Google Translate is spilling some of its internals. What's surprising is that such tags are the staple of rule-based/knowledge-based systems, whereas Google typically only does pure machine learning methods, shunning hard-coded knowledge. (One possibility is that they used a noun-compound analyser to expand their training set (which they then ran ML on, similar to how Systran & Koehn trained statistical MT on a parallel corpus translated with a rule-based MT system), but had a bug in the script to clean up the tags before training.)
It'd be fun to find out which system they used, in case it was an open source one, but unfortunately the tags are practically ungoogleable, since the web is now littered with spammy machine translated (and non-post-edited) pages full of those tags.
It seems it has to do with the way Google "translates" strings, returning what is statistically most likely correct. Common Unix commands might therefor end up in your translation.
More discussion about the topic: https://www.reddit.com/r/German/comments/47kfah/thanks_google/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With