Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting weird markup from Google translate like ~~POS=TRUNC

I'm suddenly getting same strange markup when translating phrases in Google Translate API via the Java library. Examples for English → Swedish include:

Vector graphics → vektor~~POS=TRUNC grafikk~~POS=HEADCOMP

Javascript → Javascript script~~POS=HEADCOMP

It looks like it's related to compound noun handling. Is this a feature of the API that I can deactivate somehow or is this a new bug on the server side?

like image 862
Nic Cottrell Avatar asked Nov 09 '16 11:11

Nic Cottrell


People also ask

Why is Google Translate so messed up?

It relies on probability, not accuracy. Google Translate operates off of what is known as Statistical Machine Translation (SMT). Basically, what SMT does is reference all available human-translated documents since 1957 and tries to match strings of texts from one language to another.

What kind of mistakes does Google Translate make?

Google Translate often produces translations that contain significant grammatical errors. This is due to the fact that Google's translation system uses a method based on language pair frequency that does not take into account grammatical rules. Google Translate does not have a system to correct for translation errors.


2 Answers

This looks like a bug in the server-side translator. I also get it on the web site, https://translate.google.com/#view=home&op=translate&sl=ru&tl=no&text=%D0%9E%D0%B1%D1%89%D0%B5%D0%B6%D0%B8%D1%82%D0%B8%D0%B5 gives me vandrer~~POS=TRUNC.

In NLP, "POS" means Part-Of-Speech, "HEADCOMP" sounds like it could be the head of a noun-compound, I'm guessing they TRUNCate the non-head parts of compounds (practically never inflected). So Google Translate is spilling some of its internals. What's surprising is that such tags are the staple of rule-based/knowledge-based systems, whereas Google typically only does pure machine learning methods, shunning hard-coded knowledge. (One possibility is that they used a noun-compound analyser to expand their training set (which they then ran ML on, similar to how Systran & Koehn trained statistical MT on a parallel corpus translated with a rule-based MT system), but had a bug in the script to clean up the tags before training.)

It'd be fun to find out which system they used, in case it was an open source one, but unfortunately the tags are practically ungoogleable, since the web is now littered with spammy machine translated (and non-post-edited) pages full of those tags.

like image 108
unhammer Avatar answered Oct 18 '22 22:10

unhammer


It seems it has to do with the way Google "translates" strings, returning what is statistically most likely correct. Common Unix commands might therefor end up in your translation.

More discussion about the topic: https://www.reddit.com/r/German/comments/47kfah/thanks_google/

like image 27
theBigBadBacon Avatar answered Oct 19 '22 00:10

theBigBadBacon