What database for crawler/scraper?

Tags:

I am currently researching what database to use for a project I am working on. Hopefully you guys can give me some hints.

The project is an automated web crawler that checks websites as per a user's request, scrapes data under certain circumstances, and creates log files of what was done.

Requirements:

Only few tables with few columns; predefining columns is no problem
No overly complex associations between models
Huge amount of date & time based queries
Due to logging, database will grow rapidly and use up a lot of space
Should be able to scale over multiple servers
Fields contain mostly ids (int), strings (around 200-500 characters max), and unix timestamps
Two different types of servers will simultaneously read/write data directly to/from it:
- One(/later more) rails app that takes user input and displays results upon request
- One(/later more) Node.js server that functions as the executing crawler/scraper. It will have enough load to run continuously and make dozens of database queries every second.

I assume it will neither be a graph database (no complex associations), nor a memory based key/value store (too much data to hold in cached). I'm still on the fence for every other type of database I could find, each seems to have it's merits.

So, any advice from the pros how I should decide?

Thanks.

851

asked Aug 12 '12 07:08

KonstantinK

1 Answers

Google built a database called "BigTable" for crawling, indexing and the search related business. They released a paper about it (google for "BigTable" if you're interested). There are several open source implementations for bigtable-like designs, one of them is Hypertable. We have a blog posting describing a crawler/indexer implementation (http://hypertable.com/blog/sehrchcom_a_structured_search_engine_powered_by_hypertable/) written by the guys from sehrch.com. And looking at your requirements: all of them are supported and are common use cases.

(disclaimer: i work for hypertable.)

136

answered Sep 25 '22 08:09

cruppstahl

Related questions
                            
                                Stored procedure OUTPUT VARCHAR2 value truncated using 12c client
                            
                                Compare 2 unordered recordset in memory
                            
                                Recursive Matching using CTE Query in SQL Server
                            
                                Regular expression to select a particular content, provided it is not enclosed in comments
                            
                                BINARY_CHECKSUM - different result depending on number of rows
                            
                                DELETE Statement hangs on SQL Server for no apparent reason
                            
                                Database localization
                            
                                Convert MS SQL script to Mysql and Oracle
                            
                                how to select columns as rows?
                            
                                How to do a JOIN in SQLAlchemy on 3 tables, where one of them is mapping between other two?
                            
                                Generating the SQL query plan takes 5 minutes, the query itself runs in milliseconds. What's up?
                            
                                How do I add breakpoints to a stored SQL Procedure for debugging?
                            
                                Sanitizing SQL data
                            
                                What is the best way to build a complex NSCompoundPredicate?
                            
                                @ManyToMany without join table (legacy database)
                            
                                Why does sp_executesql run slower when parameters are passed as arguments
                            
                                java library to maintain database structure
                            
                                PDO datetime format for MSSQL/dblib
                            
                                PostgreSQL: How to implement minimum cardinality?
                            
                                LISTAGG equivalent with windowing clause

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What database for crawler/scraper?

Tags:

sql

database

nosql

screen-scraping

web-crawler

KonstantinK

People also ask

1 Answers

cruppstahl

Recent Activity

Donate For Us