Currently I'm using CURL to scrape a website. I want to reliably get the title, description and keywords.
//Parse for the title, description and keywords
if (strlen($link_html) > 0)
{
$tags = get_meta_tags($link); // name
$link_keywords = $tags['keywords']; // php documentation
$link_description = $tags['description'];
}
The only problem is people are now using all kinds of meta tags, such as open graph <meta property="og:title" content="The Rock" />
. They also vary the tags a lot <title> <Title> <TITLE> <tiTle>
. It's very difficult to get these reliably.
I really need some code that will extract these variables consistently. If there is some title, keyword and description provided that it will find it. Because right now it seems very hit and miss.
Perhaps a way to extract all titles into a titles array? Then the scraping web developer can choose the best one to record in their database. The same applying to keywords and description.
This is not a duplicate. I have searched through stackoverflow and nowhere is this solution to place all "title", "keywords" and "description" type tags into arrays.
To use Scrapebox, drop a keyword (or keyword list) into ScrapeBox's Keyword Scraper Tool. Then select your scrape sources and search engines. You have a few options, including Google Suggest, YouTube Suggestions, and Google Product Search. Once you have your chosen sites, click “Scrape.”
Web scraping is the magical act of extracting information from a web page. You can do it on one page or millions of pages. There are multiple reasons why scraping is essential in SEO: We might use it for auditing a website. We might need it in the context of programmatic SEO.
Generally get_meta_tags()
should get you most of what you need, you just need to setup a set of cascading checks that will sample the required field from each metadata system until one is found. For example, something like this:
function get_title($url) {
$tags = get_meta_tags($url);
$props = get_meta_props($url);
return @tags["title"] || @props["og:title"] || ...
}
The above implementation is obviously not efficient (because if we implemetn all the getters like this you'd reload the URL for each getter), and I didn't implement get_meta_props()
- which is problematic to implement correctly using pcre_*
and tedious to implement using DOMDocument
.
Still a correct implementation is trivial though a lot of work - which is a classic scenario for an external library to solve the problem! Fortunately, there is one for just that - called simply "Embed" and you can find it on github, or using composer just run
composer require embed/embed
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With