A reliable way to scrape title, description and keywords

Tags:

Currently I'm using CURL to scrape a website. I want to reliably get the title, description and keywords.

//Parse for the title, description and keywords
if (strlen($link_html) > 0)
{
    $tags = get_meta_tags($link);     // name
    $link_keywords = $tags['keywords'];     // php documentation
    $link_description = $tags['description'];
}

The only problem is people are now using all kinds of meta tags, such as open graph <meta property="og:title" content="The Rock" />. They also vary the tags a lot <title> <Title> <TITLE> <tiTle>. It's very difficult to get these reliably.

I really need some code that will extract these variables consistently. If there is some title, keyword and description provided that it will find it. Because right now it seems very hit and miss.

Perhaps a way to extract all titles into a titles array? Then the scraping web developer can choose the best one to record in their database. The same applying to keywords and description.

This is not a duplicate. I have searched through stackoverflow and nowhere is this solution to place all "title", "keywords" and "description" type tags into arrays.

846

asked Dec 21 '15 06:12

Amy Neville

1 Answers

Generally get_meta_tags() should get you most of what you need, you just need to setup a set of cascading checks that will sample the required field from each metadata system until one is found. For example, something like this:

function get_title($url) {
  $tags = get_meta_tags($url);
  $props = get_meta_props($url);
  return @tags["title"] || @props["og:title"] || ...
}

The above implementation is obviously not efficient (because if we implemetn all the getters like this you'd reload the URL for each getter), and I didn't implement get_meta_props() - which is problematic to implement correctly using pcre_* and tedious to implement using DOMDocument.

Still a correct implementation is trivial though a lot of work - which is a classic scenario for an external library to solve the problem! Fortunately, there is one for just that - called simply "Embed" and you can find it on github, or using composer just run

composer require embed/embed

154

answered Sep 30 '22 11:09

Guss

Related questions
                            
                                PHP randomly decrements large integers by 1 [duplicate]
                            
                                PHP class not found error ONLY happens on CircleCI
                            
                                HTTP headers are not being changed: yii2
                            
                                302 image redirects slower in browsers
                            
                                Telegram BOT - setWebhook not working
                            
                                What should I name my pivot table which contains a two word table?
                            
                                PHP upload file to Ajax using onchange
                            
                                Laravel Timestamp Format / JavaScript Date
                            
                                Laravel - Eager load many-to-many, get one record only (not a collection)
                            
                                CakePHP 3 QueryBuilder: condition for few values does not generate 'IN' statement
                            
                                Check if time difference is less than 45 mins and run function - AngularJS
                            
                                How to account for font swash with PHP and GD
                            
                                How to generate namespace prefixed xml elements using SimpleXMLElement in PHP
                            
                                Passing variables to twig using hook_theme within a module
                            
                                How to retrieve older Whatsapp profile picture?
                            
                                Security concerns with base64 encoded images
                            
                                Opcache clears too quickly
                            
                                "Laravel 5.1" add user and project with userId
                            
                                Need help calculating wins and losses from points scored
                            
                                How to destroy two different sessions in the same php script?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

A reliable way to scrape title, description and keywords

Tags:

php

curl

title

Amy Neville

People also ask

1 Answers

Guss

Recent Activity

Donate For Us