The page I'm looking at contains :
<div id='1'> <p> text 1 <h1> text 2 </h1> text 3 <p> text 4 </p> </p> </div>
I want to get all the text in the div, except for the text that is in the <h>
.
(I want to get "text 1","text 3" and "text 4")
There may be a few <h>
elements, or none at all.
And there may be a few <p>
elements, even one inside the other, or none.
I thought to do this by getting all the html source of the div, and using a regex to remove the <h>
elements. But selenium.get_text does not return the html, just the text (all of it!).
I know I can use selenium.get_html_source
and then look for the element I need with a regex, but that looks like a waste since selenium knows how to find the element.
Does anyone have a better solution? Thanks :)
You can read the innerHTML attribute to get the source of the content of the element or outerHTML for the source with the current element. JavaScript: element. getAttribute('innerHTML');
We can get the text from a website using Selenium webdriver USING the getText method. It helps to obtain the text for a particular element which is visible or the inner text (which is not concealed from the page).
To get the HTML source of a WebElement in Selenium WebDriver, we can use the get_attribute method of the Selenium Python WebDriver.
Selenium offers a getText() method used to get the text of an element, i.e.; it can be used to read text values of an element from a web page.
The following code will give you the HTML in the div element:
sel = selenium('localhost', 4444, browser, my_url)
html = sel.get_eval("this.browserbot.getCurrentWindow().document.getElementById('1').innerHTML")
then you can use BeautifulSoup to parse it and extract what you really want.
I hope it helps
Use xpath. From selenium.py
:
Without an explicit locator prefix, Selenium uses the following default strategies:
- \**dom**\ , for locators starting with "document."
- \**xpath**\ , for locators starting with "//"
- \**identifier**\ , otherwise
In your case, you could try
selenium.get_text("//div[@id='1']/descendant::*[not(self::h1)]")
You can learn more about xpath here.
P.S. I don't know if there's good HTML documentation available for python-selenium, but I haven't found any; on the other hand, the docstrings of the selenium.py
file seem to constitute comprehensive documentation. So I'd suggest looking up the source to get a better understanding of how it works.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With