Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

logstach: jdbc_page_size doesn't dump all my data to elastic search

I'd like to export tom_test2 postgresql table to elastic search. The table has 176805 rows:

=> select count(*) from tom_test2;
 count  
--------
 176805
(1 row)

The following logstach conf file import correctly my data to elastic search:

input {
    jdbc {
        # Postgres jdbc connection string to our database, mydb
        jdbc_connection_string => "xxx"
        # The user we wish to execute our statement as
        jdbc_user => "xxx"
        jdbc_password => "xxx"
        # The path to our downloaded jdbc driver
        jdbc_driver_library => "xxx"
        # The name of the driver class for Postgresql
        jdbc_driver_class => "org.postgresql.Driver"
        # our query
        statement => "select * from tom_test2"
    }
}


output {
    elasticsearch {
        hosts => ["xxx"]
        index => "tom"
        document_type => "tom_test"
    }
}

In elastic search:

GET tom/tom_test/_search

  "hits": {
    "total": 176805,
    "max_score": 1,
}

I'm deleting my index in elastic search:

delete tom

And I now would like to do the same operation using jdbc_page_size in case my data becomes bigger, my logstach conf file is now:

input {
    jdbc {
        # Postgres jdbc connection string to our database, mydb
        jdbc_connection_string => "xxx"
        # The user we wish to execute our statement as
        jdbc_user => "xxx"
        jdbc_password => "xxx"
        # The path to our downloaded jdbc driver
        jdbc_driver_library => "xxx"
        # The name of the driver class for Postgresql
        jdbc_driver_class => "org.postgresql.Driver"
        # our query
        statement => "select * from tom_test2"

        jdbc_page_size => 1000
        jdbc_paging_enabled => true
    }
}


output {
    elasticsearch {
        hosts => ["xxx"]
        index => "tom"
        document_type => "tom_test"
    }
}

My count is now wrong:

GET tom/tom_test/_search

  "hits": {
    "total": 106174,
    "max_score": 1,
}

as 176805-106174=70631 rows are missing

like image 298
woshitom Avatar asked Sep 23 '17 18:09

woshitom


2 Answers

The reason you are facing this - you have ordering problems: your query doesn't controlls the order in which the data is received, and in general postgresql should not guarantie that in unordered consequent paging calls you don't fetch the same data: this produces situation when some data will be not fetched at all, and some data will be fetched multiple times :( even when the data is not modified during these calls, the background vacuum worker may change the order of the data in the physical file, and thus reproduce described situation.

Either add order to your statement SELECT * FROM tom_test2 ORDER BY id and page your data. But be aware: in this case your upload to elasticsearch will not assure the exact replica of the table at moment of time. The cause of that will be, that during logstash processing of consequent paging request the update of data in upcoming page introduced, i.e. you are uploading at the moment page 1 to 10000 and update happened at data on page 10001 and 20000, and then later otherwise... so you have problem in consistency of your data.

Or if you want to fetch all the data and generously use memory on logstash... , then you need to control the jdbc_fetch_size parameter: i.e. you are performing the same SELECT * FROM tom_test2. With this approach you will create a single query resultset, but will "pump" it in pieces, and data modification during your "pumping" will not cause you: you will be fetching the state at the moment of query start.

like image 200
Ilya Dyoshin Avatar answered Oct 08 '22 03:10

Ilya Dyoshin


Because ordering is not guaranteed between queries in jdbc_page_size as WARNED in the documentation of jdbc_paging_enabled.

I recommend using jdbc_fetch_size instead of using jdbc_page_size as the documentation also says that for large result-sets.

P.S: sometimes ;) asking your questions at http://discuss.elastic.co is better answered by elastic maintainers

like image 33
mjalil Avatar answered Oct 08 '22 04:10

mjalil