Crawl specific pages and data and make it searchable [closed]

Question

Important note: the questions below aren't meant to break ANY data copyrights. All crawled and saved data is being linked directly to the source.

For a client I'm gathering information for building a search engine/web spider combination. I do have experience with indexing webpages' inner links with a specific depth. I also have experience in scraping data from webpages. However, in this case, the volume is larger than I have experience with so I was hoping to gain some knowledge and insights in the best practice to do so.

First of all, what I need to make clear is that the client is going to deliver a list of websites that are going to be indexed. So, in fact, a vertical search engine. The results only need to have a link, title and description (like the way Google displays results). The main purpose of this search engine is to make it easier for visitors to search large amounts of sites and results to find what they need. So: Website A containts a bunch of links -> save all links together with meta data.

Secondly, there's a more specific search engine. One that also indexes all the links to (let's call them) articles, these articles are spread over many smaller sites with a smaller amount of articles compared to the sites that end up in the vertical search engine. The reason is simple: the articles found on these pages have to be scraped in as many details as possible. This is where the first problem lies: it would take a huge amount of time to write a scraper for each website, data that needs to be collected is for example: city name, article date, article title. So: Website B contains more detailed articles than website A, we are going to index these articles and scrape usefull data.

I do have a method in my mind which might work, but that involves writing a scraper for each individual website, in fact it's the only solution I can think of right now. Since the DOM of each page is completely different I see no option to build a fool-proof algorithm that searches the DOM and 'knows' what part of the page is a location (however... it's a possibility if you can match the text against a full list of cities).

A few things that crossed my mind:

Vertical Search Engine

For the vertical search engine it's pretty straight forward, we have a list of webpages that need to be indexed, it should be fairly simple to crawl all pages that match a regular expression and store the full list of these URLs in a database.
I might want to split up saving page data (meta description, title, etc) into a seperate process to speed up the indexing.
There is a possbility that there will be duplicate data in this search engine due to websites that have matching results/articles. I haven't made my mind up on how to filter these duplicates, perhaps on article title but in the business segment where the data comes from there's a huge change on duplicate titles but different articles

Page scraping

Indexing the 'to-be-scraped'-pages can be done in a similar way, as long as we know what regex to match the URLs with. We can save the list of URLs in a database
Use a seperate process that runs all individual pages, based on the URL, the scraper should now what regex to use to match the needed details on the page and write these to the database
There are enough sites that index results already, so my guess is there should be a way to create a scraping algorithm that knows how to read the pages without having to match the regex completely. As I said before: if I have a full list of city names, there must be an option to use a search algorithm to get the city name without having to say the city name lies in "#content .about .city".

Data redundance

An important part of the spider/crawler is to prevent it from indexing duplicate data. What I was hoping to do is to keep track of the time a crawler starts indexing a website and when it ends, then I'd also keep track of the 'last update time' of an article (based on the URL to the article) and remove all articles that are older than the starting time of the crawl. Because as far as I can see, these articles do no longer exists.

The data reduncance is easier with the page scraper, since my client made a list of "good sources" (read: pages with unique articles). Data redundance for the vertical search engine is harder, because the sites that are being indexed already make their own selection of artciles from "good sources". So there's a chance that multiple sites have a selection from the same sources.

How to make the results searchable

This is a question apart from how to crawl and scrape pages, because once all data is stored in the database, it needs to be searchable in high speed. The amounts of data that are going to be saved is still unknown, compared to some competition my client had an indication of about 10,000 smaller records (vertical search) and maybe 4,000 larger records with more detailed information.

I understand that this is still a small amount compared to some databases you've possibly been working on. But in the end there might be up to 10-20 search fields that a user can use the find what they are looking for. With a high traffic volume and a lot of these searches I can imagine that using regular MySQL queries for search isn't a clever idea.

So far I've found SphinxSearch and ElasticSearch. I haven't worked with any of them and haven't really looked into the possibilities of both, only thing I know is that both should perform well with high volume and larger search queries within data.

To sum things up

To sum all things up, here's a shortlist of questions I have:

Is there an easy way to create a search algorithm that is able to match DOM data without having to specify the exact div the content lies within?
What is the best practice for crawling pages (links, title & description)
Should I split crawling URLs and saving page title/description for speed?
Are there out-of-the-box solutions for PHP to find (possible) duplicate data in a database (even if there are minor differences, like: if 80% matches -> mark as duplicate)
What is the best way to create a future proof search engine for the data (keep in mind that the amounts of data can increase aswel as the site traffic and search requests)

I hope I made all things clear and I'm sorry for the huge amount of text. I guess it does show that I spend some time already in trying to figure things out myself.

Jonathan Crowe · Accepted Answer

I have experience building large scale web scrapers and can testify that there will always be big challenges to overcome when undertaking this task. Web scrapers run into problems ranging from CPU issues to storage to network problems and any custom scraper needs to be built modular enough to prevent changes in one part from breaking the application as a whole. In my projects I have taken the following approach:

Figure out where your application can be logically split up

For me this meant building 3 distinct sections:

Web Scraper Manager
Web Scraper
HTML Processor

The work could then be divided up like so:

1) The Web Scraper Manager

The Web Scraper Manager pulls URL's to be scraped and spawns Web Scrapers. The Web Scraper Manager needs to flag all URL's that have been sent to the web scrapers as being "actively scraped" and know not to pull them down again while they are in that state. Upon receiving a message from the scrapers the manager will either delete the row or leave it in the "actively scraped" state if no errors occurred, otherwise it will reset it back to "inactive"

2) The Web Scraper

The web Scraper receives a URL to scrape and goes about CURLing it and downloading the HTML. All of this HTML can then be stored in a relational database with the following structure

ID | URL | HTML (BLOB) | PROCESSING

Processing is an integer flag which indicates whether or not the data is currently being processed. This lets other parsers know not to pull the data if it is already being looked at.

3) The HTML Processor

The HTML Processor will continually read from the HTML table, marking rows as active every time it pulls a new entry. The HTML processor has the freedom to operate on the HTML for as long as needed to parse out any data. This can be links to other pages in the site which could be placed back in the URL table to start the process again, any relevant data (meta tags, etc.), images etc.

Once all relevant data has been parsed out the HTML processor would send all this data into an ElasticSearch cluster. ElasticSearch provides lightning-fast full text searches which could be made even faster by splitting the data into various keys:

{ 
   "url" : "http://example.com",
   "meta" : {
       "title" : "The meta title from the page",
       "description" : "The meta description from the page",
       "keywords" : "the,keywords,for,this,page"
   },
   "body" : "The body content in it's entirety",
   "images" : [
       "image1.png",
       "image2.png"
   ]
}

Now your website/service can have access to the latest data in real time. The parser would need to be verbose enough to handle any errors so it can set the processing flag to false if it cannot pull data out, or at least log it somewhere so it can be reviewed.

What are the advantages?

The advantage of this approach is that at any time if you want to change the way you are pulling data, processing data or storing data you can change just that piece without having to re-architect the entire application. Further, if one part of the scraper/application breaks the rest can continue to run without any data loss and without stopping other processes

What are the disadvantages?

It's a big complex system. Any time you have a big complex system you are asking for big complex bugs. Unfortunately web scraping and data processing are complex undertaking and in my experience there is no way around having a complex solution to this particularly complex problem.

Crawl specific pages and data and make it searchable [closed]

Tags:

php

search

mysql

web-scraping

web-crawler

Joshua - Pendo

1 Answers

Jonathan Crowe

Recent Activity

Donate For Us

Crawl specific pages and data and make it searchable [closed]

Tags:

php

search

mysql

web-scraping

web-crawler

Joshua - Pendo

1 Answers

Jonathan Crowe

Related questions

Recent Activity

Donate For Us