Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Preferred method of indexing bulk data into ElasticSearch?

I've been looking at ElasticSearch as solution get some better search and analytics functionality at my company. All of our data is in SQL Server at the moment and I've successfully installed the JDBC River and gotten some test data into ES.

Rivers seem like they can be deprecated in future releases and the JDBC river is maintained by a third party. And Logstash doesn't seem to support indexing from SQL Server yet (don't know if its a planned feature).

So for my situation where I want to move data from SQL Server to ElasticSearch, what's the preferred method of indexing data and maintaining the index as SQL gets updated with new data?

From the linked thread:

We recommend that you own your indexing process out-of-band from ES and make sure it scales with your needs.

I'm not quite sure where to start with this. Is it on me to use one of the APIs ES provides?

like image 814
Cuthbert Avatar asked Mar 06 '14 22:03

Cuthbert


People also ask

What is bulk indexing in Elasticsearch?

Bulk: indexing multiple documentsedit Bulk requests allow sending multiple document-related operations to Elasticsearch in one request.

Which type of data can be indexed using Elasticsearch?

By default, Elasticsearch indexes all data in every field and each indexed field has a dedicated, optimized data structure. For example, text fields are stored in inverted indices, and numeric and geo fields are stored in BKD trees.

What is indexing in Elasticsearch?

In Elasticsearch, an index (plural: indices) contains a schema and can have one or more shards and replicas. An Elasticsearch index is divided into shards and each shard is an instance of a Lucene index. Indices are used to store the documents in dedicated data structures corresponding to the data type of fields.


2 Answers

We use RabbitMQ to pipe data from SQL Server to ES. That way Rabbit takes care of the queuing and processing.

As a note, we can run over 4000 records per second from SQL into Rabbit. We do a bit more processing before putting the data into ES but we still insert into ES at over 1000 records per second. Pretty damn impressive on both ends. Rabbit and ES are both awesome!

like image 136
jhilden Avatar answered Oct 19 '22 21:10

jhilden


There are a lot of things that you can do. You can put your data in rabbitmq or redis, but your main problem is staying up to date. I guess you should look into an event based application. But if you really only have the sql server as a datasource you could work with timestamps and a query that checks for updates. Depending on the size of your database you can also just reindex the complete dataset.

Using events or the query based solution, you can push these updates to elasticsearch, probably using the bulk api.

The good part about a custom solution like this is that you can think about your mapping. This is important if you really want to do something smart with your data.

like image 30
Jettro Coenradie Avatar answered Oct 19 '22 21:10

Jettro Coenradie