Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to clear all data from AWS CloudSearch?

I have an AWS CloudSearch instance that I am still developing.

At times, such as when I make some modification to the format of a field, I find myself wanting to wipe out all of the data and regenerating it.

Is there any way to clear out all of the data using the console, or do I have to go about it by programatic means?

If I do have to use programatic means (i.e. generate and POST a bunch of "delete" SDF files) is there any good way to query for all documents in a CloudSearch instance?

I guess I could just delete and re-create the instance, but thattakes a while, and loses all of the indexes/rank expressions/text options/etc

like image 727
biggusjimmus Avatar asked Jul 09 '13 20:07

biggusjimmus


People also ask

How do I delete all files in CloudSearch?

Amazon CloudSearch currently does not provide a mechanism for deleting all of the documents in a domain. However, you can clone the domain configuration to start over with an empty domain. For more information, see Cloning an Existing Domain's Indexing Options.

What is AWS CloudSearch?

Amazon CloudSearch is a managed service in the AWS Cloud that makes it simple and cost-effective to set up, manage, and scale a search solution for your website or application. Amazon CloudSearch supports 34 languages and popular search features such as highlighting, autocomplete, and geospatial search.

What is CloudSearch used for?

Use Google Cloud Search to find the information you need at work—from anywhere, using your laptop, mobile phone, or tablet. It searches across your organization's content in Google Workspace services or third-party data sources.

What is facet in CloudSearch?

A facet is an index field that represents a category that you want to use to refine and filter search results. When you submit search requests to Amazon CloudSearch, you can request facet information to find out how many documents share the same value in a particular field.


3 Answers

Using aws and jq from the command line (tested with bash on mac):

export CS_DOMAIN=https://yoursearchdomain.yourregion.cloudsearch.amazonaws.com

# Get ids of all existing documents, reformat as
# [{ type: "delete", id: "ID" }, ...] using jq
aws cloudsearchdomain search \
  --endpoint-url=$CS_DOMAIN \
  --size=10000 \
  --query-parser=structured \
  --search-query="matchall" \
  | jq '[.hits.hit[] | {type: "delete", id: .id}]' \
  > delete-all.json

# Delete the documents
aws cloudsearchdomain upload-documents \
  --endpoint-url=$CS_DOMAIN \
  --content-type='application/json' \
  --documents=delete-all.json

For more info on jq see Reshaping JSON with jq

Update Feb 22, 2017

Added --size to get the maximum number of documents (10,000) at a time. You may need to repeat this script multiple times. Also, --search-query can take something more specific, if you want to be selective about the documents getting deleted.

like image 163
Kevin Tonon Avatar answered Sep 30 '22 01:09

Kevin Tonon


Best answer I've been able to find was somewhat buried in the AWS docs. To wit:

Amazon CloudSearch currently does not provide a mechanism for deleting all of the documents in a domain. However, you can clone the domain configuration to start over with an empty domain. For more information, see Cloning an Existing Domain's Indexing Options.

Source: http://docs.aws.amazon.com/cloudsearch/latest/developerguide/Troubleshooting.html#ts.cleardomain

like image 42
biggusjimmus Avatar answered Sep 29 '22 23:09

biggusjimmus


On my side, I used a local nodejs script like this:

var AWS = require('aws-sdk');

AWS.config.update({
    accessKeyId: '<your AccessKey>', 
    secretAccessKey: '<Your secretAccessKey>',
    region: '<your region>',
    endpoint: '<your CloudSearch endpoint'
});

var params = {
       query:"(or <your facet.FIELD:'<one facet value>' facet.FIELD:'<one facet value>')",
       queryParser:'structured'
};


var cloudsearchdomain = new AWS.CloudSearchDomain(params);
cloudsearchdomain.search(params, function(err, data) {
    var fs = require('fs');
    var result = [];
    if (err) {
        console.log("Failed");
        console.log(err);
    } else {
        resultMessage = data;
        for(var i=0;i<data.hits.hit.length;i++){
            result.push({"type":"delete","id":data.hits.hit[i].id});
        }    

        fs.writeFile("delete.json", JSON.stringify(result), function(err) {
            if(err) {return console.log(err);}
        console.log("The file was saved!");
        });
    }
});

You have to know at least all the values of on facets, to be able to request all IDs. In my code, I just put 2 (in (or ....) section), but you can have more.

Once it is done, you have one delete.json file to be used with AWS-CLI using this command :

aws cloudsearchdomain upload-documents --documents delete.json --content-type application/json --endpoint-url <your CloudSearch endpoint>

... that did the job for me !

like image 5
Arnaduga Avatar answered Sep 29 '22 23:09

Arnaduga