Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting back urls while loading multiple urls with YQL

Tags:

html

url

load

yql

I'm using YQL to fetch a bunch of pages, some of which could be offline (obviously I don't know which ones). I'm using this query:

SELECT * FROM html WHERE url IN ("http://www.whooma.net", "http://www.dfdsfsdgsfagdffgd.com", "http://www.cnn.com")

Where the first and the last one are actual sites, while the second one obviously doesn't exist. Two results are actually returned but the url from where they were loaded doesn't appear anywhere. So what would be the way to find out which html page belongs to which url, if not every page in the query is loaded?

like image 542
Matteo Monti Avatar asked Oct 02 '13 20:10

Matteo Monti


2 Answers

Unfortunately, I do not know a way where you can get a key=>value pair in the response where key being the url and value being the html response. But, you can try the following query and see if it meets your use case:

select * from yql.query.multi where queries="select * from html where url='http://www.whooma.net';select * from feed where url='http://www.dfdsfsdgsfagdffgd.com';select * from html where url='http://www.cnn.com'"

Try it here. What you can do is before firing the query, maintain the order in an array of the url in the queries like so ['http://www.whooma.net','http://www.dfdsfsdgsfagdffgd.com','http://www.cnn.com']. We can call this array A When you iterate over the response from the YQL query, the url which does not exists will return a null. A sample response from the above query:

<results>
  <results>
    // Response from select * from html where url='http://www.whooma.net'. This should be some html
  </results>
  <results>
    // Response from select * from feed where url='http://www.dfdsfsdgsfagdffgd.com'. This should be null.
  </results>
  <results>
    // select * from html where url='http://www.cnn.com'. This should also be some html
  </results>
</results>

So in conclusion, you can iterate over array A and response from YQL. The first element of array A should correspond to the first results(inner results) element of that YQL response. i.e You are creating a hashmap from two arrays. I know the answer is long but I think it was needed. Let me know if there is any confusion.

like image 162
Karan Ashar Avatar answered Nov 03 '22 07:11

Karan Ashar


You can figure out which urls aren't loading by using the YQL diagnostics flag. The diagnostics flag will cause the response to include a diagnostics property with a url array that indicates whether the corresponding servers were found. Presumably, once you eliminate the urls that didn't load, the result pages will match up with the remaining urls.

like image 27
Nathan Breit Avatar answered Nov 03 '22 09:11

Nathan Breit