Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simple web scraping in PHP

Tags:

php

To make it clear from a beginning, I have total consent to do this by the website administrator until they build an API.

What I want to do is get, say, a number or any piece of data found in a specific part of the site, althought it's place in line can change.

An example of what I wish to do, if I were to store the html in a variable through file_get_contents, and wanted to find somewhere in the source where it says "<p>User status: Online.</p>"; I would need to store the text between "status: " and ".</p>" in a variable, only knowing these two strings to find it, but knowing as well that there's only one possible scenario where those two texts are in the same line

EDIT: I seem to have forgotten the most important part of this. Well, the question is how to do what I just described, if you have a lot of text, how can I find what's between one piece of text and another piece of text, and store it in a variable?

like image 645
Markski Avatar asked Sep 30 '17 01:09

Markski


1 Answers

There are a couple ways to scrape websites, one would be to use CSS Selectors and another would be to use XPath, which both select elements from the DOM.

Since I can't see the full HTML of the webpage it would be hard for me to determine which method is better for you. There is another option which may be frowned upon, but in this case it might work.

You could use a Regex (regular expressions) to find the characters, I'm not the best at regular expressions but here is some sample code of how that might work:

<?php

$subject = "<html><body><p>Some User</p><p>User status: Online.</p></body></html>";
$pattern = '/User status: (.*)\<\/p\>/';
preg_match($pattern, $subject, $matches);
print_r($matches);

?>

Sample output:

Array
(
    [0] => User status: Online.</p>
    [1] => Online.
)

Basically what the regex above is doing is matching a pattern, in this case it looks for the string "User status: " then matches all the characters (.*) up to the ending paragraph tag (escaped).

Here is the pattern that will return just "Online" without the period, wasn't sure if all statuses ended in a period but here is what it would look like:

'/User status: (.*)\.\<\/p\>/'
like image 110
Asleepace Avatar answered Oct 11 '22 14:10

Asleepace