I would need to programmatically determine whether an RSS feed exposes the full content of its articles or just extracts of them. How would you do it?
Look for a link at the end that says "More", "Continued", "Full article", "..." or similar. Unless you want to follow every link on the page and look for the text from the feed plus extra perhaps.
I don't think there is a very clean way of doing this, but here are two "hacky" ones:
I'd parse the RSS's text, and look for any links coming out of it. Granted, there could be multiple links there (some to other blog posts), but if you focus on the last one, and try to come up with a few heuristic words for the title of the link (i.e. "more", "read full", etc), you should be able to get a lot of them. For more confidence, you can only look at the links that point back to the original blog.
A more rigorous method would have you following all the links and trying to compare if the RSS fragment is a subset of the page that comes back, or if there is a substantial overlap. This may not help whenever the site uses a true summary as opposed to fragment of the full post though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With