I'm new to XPath. I'm trying to parse a page using XPath. I need to get information from tag, but escaped apostrophe in title screws up everything.
For parsing i use Grab.
tag from source:
<img src='somelink' border='0' alt='commission:Alfred\'s misadventures' title='commission:Alfred\'s misadventures'>
Actual XPath:
g.xpath('.//tr/td/a[3]/img').get('title')
Returns
commission:Alfred\\
Is there any way to fix this?
Thanks
Garbage in, garbage out. Your input is not well-formed, because it improperly escapes the single quote character. Many programming languages (including Python) use the backslash character to escape quotes in string literals. XML does not. You should either 1) surround the attribute's value with double-quotes; or 2) use '
to include a single quote.
From the XML spec:
To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as "
'
", and the double-quote character (") as ""
".
As the provided "XML" isn't a wellformed document due to nested apostrophes, no XPath expression can be evaluated on it.
The provided non-well-formed text can be corrected to:
<img src="somelink"
border="0"
alt="commission:Alfred's misadventures"
title="commission:Alfred's misadventures"/>
In case there is a weird requiremend not to use quotes, then one correct convertion is:
<img src='somelink'
border='0'
alt='commission:Alfred's misadventures'
title='commission:Alfred's misadventures'/>
If you are provided the incorrect input, in a language such as C# one can try to convert it to its correct counterpart using:
string correctXml = input.replace("\\'s", "'s")
Probably there is a similar way to do the same in Python.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With