I often do PHP projects designed to scrape hierarchical data from web pages and save them to the DB (essentially, structure the data - think scraping government websites that do have the data, but do not provide it in a structured way). Each time, I try to come up an OOP design that would allow me to achieve the following:
So far I am yet to find the solution, but the closest I got it something like this:
I define an abstract class for data containers that would implement common tree-traversing functions:
abstract class DataContainer {
protected $parent = NULL;
protected $children = NULL;
public function getParent() {
return $this->parent;
}
public function getChildren() {
return $this->children;
}
}
And then I have the actual data containers. Imagine, I am scraping data on participation in parliamentary sessions down to a "specific question in a sitting" level. I would have SessionContainer
, SittingContainer
, QuestionContainer
that would all extend the DataContainer
.
Each of the session, sitting and question data are scraped from a different URL. Leaving the mechanism of getting the URL content aside, let's just say I need scraper classes, which would take the containers and a DOmDocument for actual parsing. So I would define an generic interface like this:
interface Scraper {
public function scrapeData(DOMDocument $Dom, DataContainer $DataContainer);
}
Then, each of the session, sitting and question would have their own scrapers, which implement the interface. But I'd also like to ensure that they only can accept the containers they are meant for. So it would look like:
class SessionScraper implements Scraper {
public function scrapeData(DOMDocument $DOM, SessionContainer $DataContainer) {
}
}
Finally, I would have a generic Factory
class that also implements Scraper interface and just distributes the scraping to relevant scrapers. Like this:
public function scrapeData(DOMDocument $DOM, DataContainer $DataContainer) {
//get the scraper from configuration array
$class = $this->config[get_class($DataContainer)];
$craper = new $class();
$class->scrapeData($DOM, $DataContainer);
}
This is the class that would be actually called in the code. Very similarly, I could deal with saving to DB - each data container could have its DBSaver class, which would implement DBSaver interface. Again, all the calls could be done via the Factory
class, which would also implement the DBSaver interface.
Everything would be perfect, but the problem is that classes that implement the interface should implement exact signature of the interface. E.g. method SessionScraper::scrapeData
cannot accept only SessionContainer
objects, it must accept all DataContainer
objects. But it is not meant to!
Finally, the question:
instanceof
and similar checks instead of enforcing it via typehinting?Thanks in advance for all the suggestions / criticisms. I am completely happy with somebody overturning this code on its head, if necessary!
In the same way that the child class can have its own properties and methods, it can override the properties and methods of the parent class. When we override the class's properties and methods, we rewrite a method or property that exists in the parent again in the child, but assign to it a different value or code.
Multiple interfaces can be implemented by a single class. The keyword "interface" is used to declare an interface. Non-abstract methods cannot be maintained by interfaces.
$this is a reserved keyword in PHP that refers to the calling object. It is usually the object to which the method belongs, but possibly another object if the method is called statically from the context of a secondary object. This keyword is only applicable to internal methods.
Inheritance in OOP = When a class derives from another class. The child class will inherit all the public and protected properties and methods from the parent class. In addition, it can have its own properties and methods. An inherited class is defined by using the extends keyword.
Container
springs into the eye. This name is very generic, you might need something more dynamic. I think you have Data
and you classify
it, so it has a type
.
So instead you hardcode the exact interface into the type hinting, you should resolve this dynamically.
If now each Container
would have a type
, the Scraper
could signal/tell whether or not it is applicable for the type
of Container
.
The concrete form of scraping is actually the strategy you use for specific data to parse it. Your container encapsulates this strategy providing an interface to the normalized data.
You just only need to add some logic/contract between Container
and Scraper
so that they can talk to each other. This contract you can put inside the interface of both.
This would also allow you to have a Scraper
that can deal with multiple types
if you want to stretch it.
For your Container
, take a look into SPL as well that you implement some interfaces so that you have iterators (and recursive iterators) available. This might be the generic structure you're referring to, and the SPL could boost the usability of your Container
classes.
You do not need to hardcode everything in OOP, you can keep things dynamic and especially in PHP you normally resolve things at runtime.
This will also allow you to easier replace Scrapers
with a new version. As
Scrapers
now would have a type by definition (as suggested above), you can resolve at runtime which concrete class should do the scraping, e.g. dynamically loading them from a .php file in a nice file-system structure.
Just my 2 cents.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With