The page I'm looking at contains : <pre class="prettyprint"><code><div id='1'> text 1 <h1> text 2 </h1> text 3 text 4 </div> </code></pre> I want to get all the text in the div, except for the text that is in the <code><h></code>. (I want to get "text 1","text 3" and "text 4") There may be a few <code><h></code> elements, or none at all. And there may be a few <code></code> elements, even one inside the other, or none. I thought to do this by getting all the html source of the div, and using a regex to remove the <code><h></code> elements. But selenium.get_text does not return the html, just the text (all of it!). I know I can use <code>selenium.get_html_source</code> and then look for the element I need with a regex, but that looks like a waste since selenium knows how to find the element. Does anyone have a better solution? Thanks :)

Use xpath. From <code>selenium.py</code>: <blockquote> Without an explicit locator prefix, Selenium uses the following default strategies: <ul> <li>\**dom**\ , for locators starting with "document."</li> <li>\**xpath**\ , for locators starting with "//"</li> <li>\**identifier**\ , otherwise</li> </ul> </blockquote> In your case, you could try <pre class="prettyprint"><code>selenium.get_text("//div[@id='1']/descendant::*[not(self::h1)]") </code></pre> You can learn more about xpath here. P.S. I don't know if there's good HTML documentation available for python-selenium, but I haven't found any; on the other hand, the docstrings of the <code>selenium.py</code> file seem to constitute comprehensive documentation. So I'd suggest looking up the source to get a better understanding of how it works.

How to get the html source of a specific element with selenium?

Tags:

python

selenium

The page I'm looking at contains :

<div id='1'> <p> text 1 <h1> text 2 </h1> text 3 <p> text 4 </p> </p> </div>

I want to get all the text in the div, except for the text that is in the <h>. (I want to get "text 1","text 3" and "text 4") There may be a few <h> elements, or none at all. And there may be a few  elements, even one inside the other, or none.

I thought to do this by getting all the html source of the div, and using a regex to remove the <h> elements. But selenium.get_text does not return the html, just the text (all of it!).

I know I can use selenium.get_html_source and then look for the element I need with a regex, but that looks like a waste since selenium knows how to find the element.

Does anyone have a better solution? Thanks :)

422

asked Nov 29 '09 18:11

Rivka

2 Answers

The following code will give you the HTML in the div element:

sel = selenium('localhost', 4444, browser, my_url)
html = sel.get_eval("this.browserbot.getCurrentWindow().document.getElementById('1').innerHTML")

then you can use BeautifulSoup to parse it and extract what you really want.

I hope it helps

answered Sep 21 '22 20:09

luc

Use xpath. From selenium.py:

Without an explicit locator prefix, Selenium uses the following default strategies:

\**dom**\ , for locators starting with "document."

\**xpath**\ , for locators starting with "//"

\**identifier**\ , otherwise

In your case, you could try

selenium.get_text("//div[@id='1']/descendant::*[not(self::h1)]")

You can learn more about xpath here.

P.S. I don't know if there's good HTML documentation available for python-selenium, but I haven't found any; on the other hand, the docstrings of the selenium.py file seem to constitute comprehensive documentation. So I'd suggest looking up the source to get a better understanding of how it works.

answered Sep 20 '22 20:09

int3

Related questions
                            
                                pip install options unclear
                            
                                how to delete char after -> without using a regular expression
                            
                                How do I get the discord.py intents to work?
                            
                                Windows keeps crashing when trying to install PyTorch via pip
                            
                                ImportError: Can't find framework /System/Library/Frameworks/OpenGL.framework
                            
                                Join two DataFrames on common columns only if the difference in a separate column is within range [-n, +n]
                            
                                Why does python's Exception's repr keep track of passed object's to __init__?
                            
                                How to "unroll" time intervals in a dataframe?
                            
                                UTF in Python Regex
                            
                                Why csv.reader is not pythonic?
                            
                                How should I return interesting values from a with-statement?
                            
                                Can I make Python 2.5 exit on ctrl-D in Windows instead of ctrl-Z?
                            
                                Using pysmbc to read files over samba
                            
                                CherryPy interferes with Twisted shutting down on Windows
                            
                                Reading a website with asyncore
                            
                                Python memory leaks?
                            
                                Django slugified urls - how to handle collisions?
                            
                                Python: smarter way to calculate loan payments
                            
                                Improvizing a drop-in replacement for the "with" statement for Python 2.4
                            
                                Longest string in numpy object_ array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With