Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What database for crawler/scraper?

I am currently researching what database to use for a project I am working on. Hopefully you guys can give me some hints.

The project is an automated web crawler that checks websites as per a user's request, scrapes data under certain circumstances, and creates log files of what was done.

Requirements:

  • Only few tables with few columns; predefining columns is no problem
  • No overly complex associations between models
  • Huge amount of date & time based queries
  • Due to logging, database will grow rapidly and use up a lot of space
  • Should be able to scale over multiple servers
  • Fields contain mostly ids (int), strings (around 200-500 characters max), and unix timestamps
  • Two different types of servers will simultaneously read/write data directly to/from it:
    • One(/later more) rails app that takes user input and displays results upon request
    • One(/later more) Node.js server that functions as the executing crawler/scraper. It will have enough load to run continuously and make dozens of database queries every second.

I assume it will neither be a graph database (no complex associations), nor a memory based key/value store (too much data to hold in cached). I'm still on the fence for every other type of database I could find, each seems to have it's merits.

So, any advice from the pros how I should decide?

Thanks.

like image 851
KonstantinK Avatar asked Aug 12 '12 07:08

KonstantinK


People also ask

What is a crawler database?

What is the meaning of data crawling on the Internet? A web crawler (or a spider tool) is an automated script that helps you browse and gather publicly available data on the web. Many websites use data crawling to get up-to-date data.

Can you use SQL for web scraping?

SQL Machine Learning language helps you in web scrapping with a small piece of code. In the previous articles for SQL Server R scripts, we explored the useful open-source libraries for adding new functionality in R.


1 Answers

Google built a database called "BigTable" for crawling, indexing and the search related business. They released a paper about it (google for "BigTable" if you're interested). There are several open source implementations for bigtable-like designs, one of them is Hypertable. We have a blog posting describing a crawler/indexer implementation (http://hypertable.com/blog/sehrchcom_a_structured_search_engine_powered_by_hypertable/) written by the guys from sehrch.com. And looking at your requirements: all of them are supported and are common use cases.

(disclaimer: i work for hypertable.)

like image 136
cruppstahl Avatar answered Sep 25 '22 08:09

cruppstahl