We have an existing search function that involves data across multiple tables in SQL Server. This causes a heavy load on our DB, so I'm trying to find a better way to search through this data (it doesn't change very often). I have been working with Logstash and Elasticsearch for about a week using an import containing 1.2 million records. My question is essentially, "how do I update existing documents using my 'primary key'"?
CSV data file (pipe delimited) looks like this:
369|90045|123 ABC ST|LOS ANGELES|CA
368|90045|PVKA0010|LA|CA
367|90012|20000 Venice Boulvd|Los Angeles|CA
365|90045|ABC ST 123|LOS ANGELES|CA
363|90045|ADHOCTESTPROPERTY|DALES|CA
My logstash config looks like this:
input {
stdin {
type => "stdin-type"
}
file {
path => ["C:/Data/sample/*"]
start_position => "beginning"
}
}
filter {
csv {
columns => ["property_id","postal_code","address_1","city","state_code"]
separator => "|"
}
}
output {
elasticsearch {
embedded => true
index => "samples4"
index_type => "sample"
}
}
A document in elasticsearch, then looks like this:
{
"_index": "samples4",
"_type": "sample",
"_id": "64Dc0_1eQ3uSln_k-4X26A",
"_score": 1.4054651,
"_source": {
"message": [
"369|90045|123 ABC ST|LOS ANGELES|CA\r"
],
"@version": "1",
"@timestamp": "2014-02-11T22:58:38.365Z",
"host": "[host]",
"path": "C:/Data/sample/sample.csv",
"property_id": "369",
"postal_code": "90045",
"address_1": "123 ABC ST",
"city": "LOS ANGELES",
"state_code": "CA"
}
I think would like the unique ID in the _id
field, to be replaced with the value of property_id
. The idea is that subsequent data files would contain updates. I don't need to keep previous versions and there wouldn't be a case where we added or removed keys from a document.
The document_id
setting for elasticsearch output doesn't put that field's value into _id
(it just put in "property_id" and only stored/updated one document). I know I'm missing something here. Am I just taking the wrong approach?
EDIT: WORKING!
Using @rutter's suggestion, I've updated the output
config to this:
output {
elasticsearch {
embedded => true
index => "samples6"
index_type => "sample"
document_id => "%{property_id}"
}
}
Now documents are updating by dropping new files into the data folder as expected. _id
and property_id
are the same value.
{
"_index": "samples6",
"_type": "sample",
"_id": "351",
"_score": 1,
"_source": {
"message": [
"351|90045|Easy as 123 ST|LOS ANGELES|CA\r"
],
"@version": "1",
"@timestamp": "2014-02-12T16:12:52.102Z",
"host": "TXDFWL3474",
"path": "C:/Data/sample/sample_update_3.csv",
"property_id": "351",
"postal_code": "90045",
"address_1": "Easy as 123 ST",
"city": "LOS ANGELES",
"state_code": "CA"
}
The steps are as follows: Install Logstash via your package manager or by downloading and unzipping the tgz/zip file. Install the Logstash rss input plugin, which allows for reading RSS data sources: ./bin/logstash-plugin install logstash-input-rss.
For little changes in Index or index settings you can use update API where you can update index settings ( No of replicas, refresh interval etc.) . Also, you can update documents and add field using update API in Elasticsearch.
Elasticsearch has an Update API that can be used to process updates and deletes. The Update API reduces the number of network trips and potential for version conflicts. The Update API retrieves the existing document from the index, processes the change and then indexes the data again.
Converting from comment:
You can overwrite a document by sending another document with the same ID... but that might be tricky with your previous data, since you'll get randomized IDs by default.
You can set an ID using the output plugin's document_id
field, but it takes a literal string, not a field name. To use a field's contents, you could use an sprintf format string, such as %{property_id}
.
Something like this, for example:
output {
elasticsearch {
... other settings...
document_id => "%{property_id}"
}
}
declaimer - I'm the author of ESL
You can use elasticsearch_loader to load psv files into elasticsearch.
In order to set the _id field you can use --id-field=property_id.
for instance:elasticsearch_loader --index=myindex --type=mytype --id-field=property_id csv --delimiter='|' filename.csv
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With