Importing and updating data in Elasticsearch

Tags:

We have an existing search function that involves data across multiple tables in SQL Server. This causes a heavy load on our DB, so I'm trying to find a better way to search through this data (it doesn't change very often). I have been working with Logstash and Elasticsearch for about a week using an import containing 1.2 million records. My question is essentially, "how do I update existing documents using my 'primary key'"?

CSV data file (pipe delimited) looks like this:

369|90045|123 ABC ST|LOS ANGELES|CA
368|90045|PVKA0010|LA|CA
367|90012|20000 Venice Boulvd|Los Angeles|CA
365|90045|ABC ST 123|LOS ANGELES|CA
363|90045|ADHOCTESTPROPERTY|DALES|CA

My logstash config looks like this:

input {
  stdin {
    type => "stdin-type"
  }

  file {
    path => ["C:/Data/sample/*"]
    start_position => "beginning"
  }
}

filter {
  csv {
    columns => ["property_id","postal_code","address_1","city","state_code"]
    separator => "|"
  }
}

output {
  elasticsearch {
    embedded => true
    index => "samples4"
    index_type => "sample"
  }
}

A document in elasticsearch, then looks like this:

{
   "_index": "samples4",
   "_type": "sample",
   "_id": "64Dc0_1eQ3uSln_k-4X26A",
   "_score": 1.4054651,
   "_source": {
   "message": [
      "369|90045|123 ABC ST|LOS ANGELES|CA\r"
   ],
   "@version": "1",
   "@timestamp": "2014-02-11T22:58:38.365Z",
   "host": "[host]",
   "path": "C:/Data/sample/sample.csv",
   "property_id": "369",
   "postal_code": "90045",
   "address_1": "123 ABC ST",
   "city": "LOS ANGELES",
   "state_code": "CA"
}

I think would like the unique ID in the _id field, to be replaced with the value of property_id. The idea is that subsequent data files would contain updates. I don't need to keep previous versions and there wouldn't be a case where we added or removed keys from a document.

The document_id setting for elasticsearch output doesn't put that field's value into _id (it just put in "property_id" and only stored/updated one document). I know I'm missing something here. Am I just taking the wrong approach?

EDIT: WORKING!

Using @rutter's suggestion, I've updated the output config to this:

output {
  elasticsearch {
    embedded => true
    index => "samples6"
    index_type => "sample"
    document_id => "%{property_id}"
  }
}

Now documents are updating by dropping new files into the data folder as expected. _id and property_id are the same value.

{
   "_index": "samples6",
   "_type": "sample",
   "_id": "351",
   "_score": 1,
   "_source": {
   "message": [
      "351|90045|Easy as 123 ST|LOS ANGELES|CA\r"
   ],
   "@version": "1",
   "@timestamp": "2014-02-12T16:12:52.102Z",
   "host": "TXDFWL3474",
   "path": "C:/Data/sample/sample_update_3.csv",
   "property_id": "351",
   "postal_code": "90045",
   "address_1": "Easy as 123 ST",
   "city": "LOS ANGELES",
   "state_code": "CA"
}

887

asked Feb 12 '14 00:02

Adrian J. Moreno

2 Answers

Converting from comment:

You can overwrite a document by sending another document with the same ID... but that might be tricky with your previous data, since you'll get randomized IDs by default.

You can set an ID using the output plugin's document_id field, but it takes a literal string, not a field name. To use a field's contents, you could use an sprintf format string, such as %{property_id}.

Something like this, for example:

output {
  elasticsearch {
    ... other settings...
    document_id => "%{property_id}"
  }
}

186

answered Sep 30 '22 22:09

rutter

declaimer - I'm the author of ESL
You can use elasticsearch_loader to load psv files into elasticsearch.
In order to set the _id field you can use --id-field=property_id. for instance:
elasticsearch_loader --index=myindex --type=mytype --id-field=property_id csv --delimiter='|' filename.csv

answered Sep 30 '22 20:09

MosheZada

Related questions
                            
                                How do I best generate a CSV (comma-delimited text file) for download with ASP.NET?
                            
                                Cannot import data from csv file in d3
                            
                                How to parse tsv file with python?
                            
                                Parse CSV, ignoring commas inside string literals in VBA?
                            
                                CSV parsing in Java - working example..? [closed]
                            
                                read comma-separated input with `scanf()`
                            
                                UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte, while reading csv file in pandas
                            
                                Can't install csv module
                            
                                python: creating excel workbook and dumping csv files as worksheets
                            
                                Reading in a local csv file in javascript? [closed]
                            
                                How can I convert my JSON to CSV using jq?
                            
                                Scrapy pipeline to export csv file in the right format
                            
                                Is there a "proper" way to read CSV files [duplicate]
                            
                                Convert and save distance matrix to a specific format
                            
                                pandas.read_csv FileNotFoundError: File b'\xe2\x80\xaa<etc>' despite correct path
                            
                                Removing whitespaces in a CSV file
                            
                                How write into CSV file properly
                            
                                Setting column types while reading csv with pandas
                            
                                When to use tensorflow datasets api versus pandas or numpy
                            
                                Specify correct dtypes to pandas.read_csv for datetimes and booleans

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Importing and updating data in Elasticsearch

Tags:

csv

elasticsearch

logstash

Adrian J. Moreno

People also ask

2 Answers

rutter

MosheZada

Recent Activity

Donate For Us