Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python XPath parsing tag with apostrophe

I'm new to XPath. I'm trying to parse a page using XPath. I need to get information from tag, but escaped apostrophe in title screws up everything.

For parsing i use Grab.

tag from source:

<img src='somelink' border='0' alt='commission:Alfred\'s misadventures' title='commission:Alfred\'s misadventures'>

Actual XPath:

g.xpath('.//tr/td/a[3]/img').get('title')

Returns

commission:Alfred\\

Is there any way to fix this?

Thanks

like image 349
Stanislav Golovanov Avatar asked Dec 10 '11 20:12

Stanislav Golovanov


2 Answers

Garbage in, garbage out. Your input is not well-formed, because it improperly escapes the single quote character. Many programming languages (including Python) use the backslash character to escape quotes in string literals. XML does not. You should either 1) surround the attribute's value with double-quotes; or 2) use &apos; to include a single quote.

From the XML spec:

To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as " &apos; ", and the double-quote character (") as " &quot; ".

like image 179
Wayne Avatar answered Oct 23 '22 01:10

Wayne


As the provided "XML" isn't a wellformed document due to nested apostrophes, no XPath expression can be evaluated on it.

The provided non-well-formed text can be corrected to:

<img src="somelink"
 border="0"
 alt="commission:Alfred's misadventures"
 title="commission:Alfred's misadventures"/>

In case there is a weird requiremend not to use quotes, then one correct convertion is:

<img src='somelink'
 border='0'
 alt='commission:Alfred&apos;s misadventures'
 title='commission:Alfred&apos;s misadventures'/>

If you are provided the incorrect input, in a language such as C# one can try to convert it to its correct counterpart using:

string correctXml = input.replace("\\'s", "&apos;s")

Probably there is a similar way to do the same in Python.

like image 30
Dimitre Novatchev Avatar answered Oct 23 '22 01:10

Dimitre Novatchev