Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract text of website using Selenium and Python

I want to extract all the text in a specific webpage.

In JavaScript the code looks like this:

var webPage = require('webpage');
var page = webPage.create();

page.open('http://phantomjs.org', function (status) {
    console.log('Stripped down page text:\n' + page.plainText);
    phantom.exit();
});

How can I run page.plainText in Python?

Thanks.

like image 859
kambi Avatar asked Apr 22 '26 23:04

kambi


2 Answers

If you want to do that with Selenium, you have to select the "top" element and after the call to getText().

For example, in Python:

driver = webdriver.PhantomJS(executable_path='pathTo/phantomjs')
driver.get('https://en.wikipedia.org/wiki/Selenium_(software)')
el = driver.find_element_by_tag_name('body')
print(el.text)
driver.close()
like image 118
Davide Patti Avatar answered Apr 27 '26 13:04

Davide Patti


Try this code:

text = driver.find_element_by_tag_name("body").get_attribute("innerText")
like image 20
Ratmir Asanov Avatar answered Apr 27 '26 13:04

Ratmir Asanov



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!