I have a lxml etree HTMLParser object that I'm trying to build xpaths with to assert xpaths, attributes of the xpath and text of that tag. I ran into a problem when the text of the tag has either single-quotes(') or double-quotes(") and I've exhausted all my options. Here's a sample object I created <pre class="prettyprint"><code>parser = etree.HTMLParser() tree = etree.parse(StringIO(<html><body>Here is my 'test' "string"</body></html>), parser) </code></pre> Here is the snippet of code and then different variations of the variable being read in <pre class="prettyprint"><code> def getXpath(self) xpath += 'starts-with(., \'' + self.text + '\') and ' xpath += ('count(@*)=' + str(attrsCount) if self.exactMatch else "1=1") + ']' </code></pre> self.text is basically the expected text of the tag, in this case: Here is my 'test' "string" this fails when i try to use the xpath method of the HTMLParser object <pre class="prettyprint"><code>tree.xpath(self.getXpath()) </code></pre> Reason is because the xpath that it gets is this '/html/body/p[starts-with(.,'Here is my 'test' "string"') and 1=1]' How can I properly escape the single and double quotes from the self.text variable? I've tried triple quoting, wrapping self.text in repr(), or doing a re.sub or string.replace escaping ' and " with \' and \"

According to what we can see in Wikipedia and w3 school, you should not have <code>'</code> and <code>"</code> in nodes content, even if only <code><</code> and <code>&</code> are said to be stricly illegal. They should be replaced by corresponding "predefined entity references", that are <code>&apos;</code> and <code>&quot;</code>. By the way, the Python parsers I use will take care of this transparently: when writing, they are replaced; when reading, they are converted. After a second reading of your answer, I tested some stuff with the <code>'</code> and so on in Python interpreter. And it will escape everything for you! <pre class="prettyprint"><code>>>> 'text {0}'.format('blabla "some" bla') 'text blabla "some" bla' >>> 'ntsnts {0}'.format("ontsi'tns") "ntsnts ontsi'tns" >>> 'ntsnts {0}'.format("ontsi'tn' \"ntsis") 'ntsnts ontsi\'tn\' "ntsis' </code></pre> So we can see that Python escapes things correctly. Could you then copy-paste the error message you get (if any)?

How to properly escape single and double quotes

Tags:

python

lxml

I have a lxml etree HTMLParser object that I'm trying to build xpaths with to assert xpaths, attributes of the xpath and text of that tag. I ran into a problem when the text of the tag has either single-quotes(') or double-quotes(") and I've exhausted all my options.

Here's a sample object I created

parser = etree.HTMLParser()
tree = etree.parse(StringIO(<html><body><p align="center">Here is my 'test' "string"</p></body></html>), parser)

Here is the snippet of code and then different variations of the variable being read in

   def getXpath(self)
     xpath += 'starts-with(., \'' + self.text + '\') and '
     xpath += ('count(@*)=' + str(attrsCount) if self.exactMatch else "1=1") + ']'

self.text is basically the expected text of the tag, in this case: Here is my 'test' "string"

this fails when i try to use the xpath method of the HTMLParser object

tree.xpath(self.getXpath())

Reason is because the xpath that it gets is this '/html/body/p[starts-with(.,'Here is my 'test' "string"') and 1=1]'

How can I properly escape the single and double quotes from the self.text variable? I've tried triple quoting, wrapping self.text in repr(), or doing a re.sub or string.replace escaping ' and " with \' and \"

841

asked Oct 18 '11 04:10

Bob Evans

1 Answers

According to what we can see in Wikipedia and w3 school, you should not have ' and " in nodes content, even if only < and & are said to be stricly illegal. They should be replaced by corresponding "predefined entity references", that are ' and ".

By the way, the Python parsers I use will take care of this transparently: when writing, they are replaced; when reading, they are converted.

After a second reading of your answer, I tested some stuff with the ' and so on in Python interpreter. And it will escape everything for you!

>>> 'text {0}'.format('blabla "some" bla')
'text blabla "some" bla'
>>> 'ntsnts {0}'.format("ontsi'tns")
"ntsnts ontsi'tns"
>>> 'ntsnts {0}'.format("ontsi'tn' \"ntsis")
'ntsnts ontsi\'tn\' "ntsis'

So we can see that Python escapes things correctly. Could you then copy-paste the error message you get (if any)?

answered Nov 04 '22 20:11

Joël

Related questions
                            
                                Given a module on pypi, is there a way to introspect the module and show all dependencies?
                            
                                Install local extras in Python
                            
                                How to use SVG DOM in Python
                            
                                mechanize submit form character encoding problem
                            
                                Common neighbors and preferential attachment score matrixes using igraph for python
                            
                                Stop parsing on first unknown argument
                            
                                allowing invalid dates in python datetime
                            
                                SWIG: 'module' object has no attribute 'Decklist'
                            
                                building executable using python,vtk and py2exe
                            
                                2D Interpolation of Large Irregular Grid to Regular Grid
                            
                                How to access Object attributes within numpy array structure
                            
                                Managing Python installations
                            
                                Executing bash with subprocess.Popen
                            
                                What is the proper way to update a listfield of embedded documents in mongoengine?
                            
                                How to efficiently store this parsed XML document in MySQL Database using Python?
                            
                                How do I safely destroy a dialog window of a wxPython application?
                            
                                Old code still being executed in ipython after files have been modified
                            
                                Embedding python + numpy code into C++ dll callback
                            
                                Problems getting the Heroku example app for Python run locally with SSL
                            
                                Data structure for large ranges of consecutive integers?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With