I am trying to use YQL to extract a portion of HTML from a series of web pages. The pages themselves have slightly different structure (so a Yahoo Pipes "Fetch Page" with its "Cut content" feature does not work well) but the fragment I am interested in always has the same class
attribute.
If I have an HTML page like this:
<html>
<body>
<div class="foo">
<p>Wolf</p>
<ul>
<li>Dog</li>
<li>Cat</li>
</ul>
</div>
</body>
</html>
and use a YQL expression like this:
SELECT * FROM html
WHERE url="http://example.com/containing-the-fragment-above"
AND xpath="//div[@class='foo']"
what I get back are the (apparently unordered?) DOM elements, where what I want is the HTML content itself. I've tried SELECT content
as well, but that only selects textual content. I want HTML. Is this possible?
You could write a little Open Data Table to send out a normal YQL html
table query and stringify the result. Something like the following:
<?xml version="1.0" encoding="UTF-8" ?>
<table xmlns="http://query.yahooapis.com/v1/schema/table.xsd">
<meta>
<sampleQuery>select * from {table} where url="http://finance.yahoo.com/q?s=yhoo" and xpath='//div[@id="yfi_headlines"]/div[2]/ul/li/a'</sampleQuery>
<description>Retrieve HTML document fragments</description>
<author>Peter Cowburn</author>
</meta>
<bindings>
<select itemPath="result.html" produces="JSON">
<inputs>
<key id="url" type="xs:string" paramType="variable" required="true"/>
<key id="xpath" type="xs:string" paramType="variable" required="true"/>
</inputs>
<execute><![CDATA[
var results = y.query("select * from html where url=@url and xpath=@xpath", {url:url, xpath:xpath}).results.*;
var html_strings = [];
for each (var item in results) html_strings.push(item.toXMLString());
response.object = {html: html_strings};
]]></execute>
</select>
</bindings>
</table>
You could then query against that custom table with a YQL query like:
use "http://url.to/your/datatable.xml" as html.tostring;
select * from html.tostring where
url="http://finance.yahoo.com/q?s=yhoo"
and xpath='//div[@id="yfi_headlines"]/div[2]/ul/li'
Edit: Just realised this is a pretty old question that was bumped; at least an answer is here, eventually, for anyone stumbling on the question. :)
I had this same exact problem. The only way I have gotten around it is to avoid YQL and just use regular expressions to match the start and end tags :/. Not the best solution, but if the html is relatively unchanging, and the pattern just from say <div class='name'>
to <div class='just_after
>`, then you can get away with that. Then you can get the html between.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With