I'm trying to extract Google Translate's pinyin transliteration of a Chinese word using Selenium but am having some trouble finding its WebElement.
For example, the word I look up is "事". My code would be as follows:
String word = "事";
WebDriver driver = new HtmlUnitDriver();
driver.get("http://translate.google.com/#zh-CN/zh-CN/" + word);
When I go to the actual page using my browser, I can see that its pinyin is "Shì" and that its id, according to Inspect Element is src-translit. However, when I go to view source, though the id="src-translit" is present, you don't see anything resembling "Shì" nearby. It's simply empty.
Thinking that the page has had no time to load properly. I implemented a waiting period of 30 seconds (kind of a long wait, I know, but I just wanted to know if it would work).
int timeoutInSeconds = 30;
WebDriverWait wait = new WebDriverWait(driver, timeoutInSeconds);
wait.until(ExpectedConditions.visibilityOfElementLocated(By.id("src-translit")));
Unfortunately, even with the wait time, transliteration and its text still returns as empty.
WebElement transliteration = driver.findElement(By.id("src-translit"));
String pinyin = transliteration.getText();
My question, then, is: what's happened to the src-translit? Why won't it display in the html code and how can I go about finding it and copying it from Google Translate?
Sounds like javascript isn't being executed. Looking at the docs, you can enable javascript like this
HtmlUnitDriver driver = new HtmlUnitDriver();
driver.setJavascriptEnabled(true);
or
HtmlUnitDriver driver = new HtmlUnitDriver(true);
See if that makes a difference.
EDIT:
I still think the problem is related to javascript. When I run it using FirefoxDriver, it works fine: the AJAX request is made, and src-translit element has been updated with Shi.
Workaround:
In any case, monitoring the network traffic, you can see that when you want to translate 事 , it makes an AJAX call to
http://translate.google.com/translate_a/t?client=t&sl=zh-CN&tl=zh-CN&hl=en&sc=2&ie=UTF-8&oe=UTF-8&pc=1&oc=1&otf=1&rom=1&srcrom=1&ssel=0&tsel=0&q=%E6%B2%92%E4%BA%8B
Which returns JSON:
[[["事","事","Shì","Shì"]],,"zh-CN",,[["事",,false,false,0,0,0,0]],,,,[],10]
Maybe you could parse that instead for now.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With