I have a bunch of URLs stored in a table waiting to be scraped by a script. However, many of those URLs are from the same site. I would like to return those URLs in a "site-friendly" order (that is, try to avoid two URLs from the same site in a row) so I won't be accidentally blocked by making too many http requests in a short time.
The database layout is something like this:
create table urls ( site varchar, -- holds e.g. www.example.com or stockoverflow.com url varchar unique );
Example result:
SELECT url FROM urls ORDER BY mysterious_round_robin_function(site); http://www.example.com/some/file http://stackoverflow.com/questions/ask http://use.perl.org/ http://www.example.com/some/other/file http://stackoverflow.com/tags
I thought of something like "ORDER BY site <> @last_site DESC
" but I have no idea how to go about writing something like that.
See this article in my blog for more detailed explanations on how it works:
With new PostgreSQL 8.4
:
SELECT *
FROM (
SELECT site, url, ROW_NUMBER() OVER (PARTITION BY site ORDER BY url) AS rn
FROM urls
)
ORDER BY
rn, site
With elder versions:
SELECT site,
(
SELECT url
FROM urls ui
WHERE ui.site = sites.site
ORDER BY
url
OFFSET total
LIMIT 1
) AS url
FROM (
SELECT site, generate_series(0, cnt - 1) AS total
FROM (
SELECT site, COUNT(*) AS cnt
FROM urls
GROUP BY
site
) s
) sites
ORDER BY
total, site
, though it can be less efficient.
I think you're overcomplicating this. Why not just use
ORDER BY NewID()
You are asking for round-robin, but I think a simple
SELECT site, url FROM urls ORDER BY RANDOM()
will do the trick. It should work even if urls from the same site are clustered in db.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With