I'd like to export tom_test2 postgresql table to elastic search. The table has 176805 rows:
=> select count(*) from tom_test2;
count
--------
176805
(1 row)
The following logstach conf file import correctly my data to elastic search:
input {
jdbc {
# Postgres jdbc connection string to our database, mydb
jdbc_connection_string => "xxx"
# The user we wish to execute our statement as
jdbc_user => "xxx"
jdbc_password => "xxx"
# The path to our downloaded jdbc driver
jdbc_driver_library => "xxx"
# The name of the driver class for Postgresql
jdbc_driver_class => "org.postgresql.Driver"
# our query
statement => "select * from tom_test2"
}
}
output {
elasticsearch {
hosts => ["xxx"]
index => "tom"
document_type => "tom_test"
}
}
In elastic search:
GET tom/tom_test/_search
"hits": {
"total": 176805,
"max_score": 1,
}
I'm deleting my index in elastic search:
delete tom
And I now would like to do the same operation using jdbc_page_size in case my data becomes bigger, my logstach conf file is now:
input {
jdbc {
# Postgres jdbc connection string to our database, mydb
jdbc_connection_string => "xxx"
# The user we wish to execute our statement as
jdbc_user => "xxx"
jdbc_password => "xxx"
# The path to our downloaded jdbc driver
jdbc_driver_library => "xxx"
# The name of the driver class for Postgresql
jdbc_driver_class => "org.postgresql.Driver"
# our query
statement => "select * from tom_test2"
jdbc_page_size => 1000
jdbc_paging_enabled => true
}
}
output {
elasticsearch {
hosts => ["xxx"]
index => "tom"
document_type => "tom_test"
}
}
My count is now wrong:
GET tom/tom_test/_search
"hits": {
"total": 106174,
"max_score": 1,
}
as 176805-106174=70631 rows are missing
The reason you are facing this - you have ordering problems: your query doesn't controlls the order in which the data is received, and in general postgresql should not guarantie that in unordered consequent paging calls you don't fetch the same data: this produces situation when some data will be not fetched at all, and some data will be fetched multiple times :( even when the data is not modified during these calls, the background vacuum worker may change the order of the data in the physical file, and thus reproduce described situation.
Either add order to your statement SELECT * FROM tom_test2 ORDER BY id
and page your data. But be aware: in this case your upload to elasticsearch will not assure the exact replica of the table at moment of time. The cause of that will be, that during logstash processing of consequent paging request the update of data in upcoming page introduced, i.e. you are uploading at the moment page 1 to 10000 and update happened at data on page 10001 and 20000, and then later otherwise... so you have problem in consistency of your data.
Or if you want to fetch all the data and generously use memory on logstash... , then you need to control the jdbc_fetch_size
parameter: i.e. you are performing the same SELECT * FROM tom_test2
. With this approach you will create a single query resultset, but will "pump" it in pieces, and data modification during your "pumping" will not cause you: you will be fetching the state at the moment of query start.
Because ordering is not guaranteed between queries in jdbc_page_size
as WARNED in the documentation of jdbc_paging_enabled
.
I recommend using jdbc_fetch_size
instead of using jdbc_page_size
as the documentation also says that for large result-sets.
P.S: sometimes ;) asking your questions at http://discuss.elastic.co is better answered by elastic maintainers
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With