Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating an online search engine for large XML database (10GB - 1TB of data) with server running

I have been using Node.js to create a website that will eventually be able to search Google Patent Grant database that provides data in XML format. I've been using MongoDB for the User database, but someone told me that they had a lot of difficulty creating a fast search engine using MongoDB, they also said that it grew very large. What database technology/software should I use in conjunction with Node.js to create an efficient search engine? Would it be a bad idea to have two different database technologies running for one website, e.g. MongoDB and PostgreSQL? I found a technology called Norch on github https://github.com/fergiemcdowall/norch . Would this technology be helpful?

like image 246
Daniel Kobe Avatar asked Nov 18 '25 08:11

Daniel Kobe


2 Answers

You are going to have hard time matching or beating lucene in text search with either Postgres or mongodb. Thus Solr or Elasticsearch are better options (they both use lucene).

That being said most people still store their data in something other than the search index and thus implement some kind of synchronization between the search index and data repository.

Edit based on comment:

An example combination would be Solr and Postgres. Solr would be your search engine and Postgres would be your data repository. You could then use the DataImportHandler to pull the data from Postgres.

like image 88
Adam Gent Avatar answered Nov 19 '25 23:11

Adam Gent


Author of Norch here.

At the moment, Solr and Elasticsearch are probably the most used search technologies, and with good reason- they are now very mature, powerful and user friendly.

Norch is a good fit for the following scenarios:

  1. If you have a requirement that your technology stack be javascript then java (solr, elasticsearch) is out. Norch allows you to run everything on javascript

  2. If you want to run a search engine on really low-end hardware. Norch has ridiculously low system requirements, especially for smaller datasets

  3. "Offline first" webpages. Norch allows you to replicate a search index into a users browser. People are still getting their heads around what the best ways of doing this are, and when it is best to do it, but this ability to easily replicate itself into client machines is what makes Norch different from competing projects.

  4. If you have a corpus that you want to share. Rather than sharing say 1 million files, you could index them into Norch, replicate, and share the replication file. You can email it, torrent it, or put it on the web. Norch is quite good at replicating indexes.

There are also some other corner cases where Norch is good/best- but the those mentioned above are the main ones.

like image 42
Fergie Avatar answered Nov 19 '25 23:11

Fergie