Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to index database table data in Solr?

I have a table with around 100,000 rows at the moment. I want to index the data in this table in a Solr Index.

So the naive method would be to:

  • Get all the rows
  • For each row: convert to a SolrDocument and add each document to a request
  • Once all rows are converted then post the request

Some problems with this approach that I can think of are:

  • Loading too much data (the content of the whole table) in to memory
  • POSTing a big request

However, some advantages:

  • Only one request to the Database
  • Only one POST request to Solr

The approach is not scalable, I see that since as the table grows so will the memory requirements and the size of the POST request. I need to perhaps take n number of rows, process them, then take the next n?

I'm wondering if any one has any advice about how to best implement this?

(ps. I did search the site but I didn't find any questions that were similar to this.)

Thanks.

like image 382
C0deAttack Avatar asked Oct 24 '22 10:10

C0deAttack


2 Answers

If you want to balance between POSTing all documents at once and doing one POST per document you could use a queue to collect documents and run a separate thread that sends documents once you have collected enough. This way you can manage the memory vs. request time problem.

like image 152
nfechner Avatar answered Oct 29 '22 23:10

nfechner


I used the suggestion from nikhil500:

DIH does support many transformers. You can also write custom transformers. I will recommend using DIH if possible - I think it will need the least amount of coding and will be faster than POSTing the documents. – nikhil500 Feb 6 at 17:42

like image 21
C0deAttack Avatar answered Oct 30 '22 00:10

C0deAttack