I have an AWS CloudSearch instance that I am still developing.
At times, such as when I make some modification to the format of a field, I find myself wanting to wipe out all of the data and regenerating it.
Is there any way to clear out all of the data using the console, or do I have to go about it by programatic means?
If I do have to use programatic means (i.e. generate and POST a bunch of "delete" SDF files) is there any good way to query for all documents in a CloudSearch instance?
I guess I could just delete and re-create the instance, but thattakes a while, and loses all of the indexes/rank expressions/text options/etc
Amazon CloudSearch currently does not provide a mechanism for deleting all of the documents in a domain. However, you can clone the domain configuration to start over with an empty domain. For more information, see Cloning an Existing Domain's Indexing Options.
Amazon CloudSearch is a managed service in the AWS Cloud that makes it simple and cost-effective to set up, manage, and scale a search solution for your website or application. Amazon CloudSearch supports 34 languages and popular search features such as highlighting, autocomplete, and geospatial search.
Use Google Cloud Search to find the information you need at work—from anywhere, using your laptop, mobile phone, or tablet. It searches across your organization's content in Google Workspace services or third-party data sources.
A facet is an index field that represents a category that you want to use to refine and filter search results. When you submit search requests to Amazon CloudSearch, you can request facet information to find out how many documents share the same value in a particular field.
Using aws and jq from the command line (tested with bash on mac):
export CS_DOMAIN=https://yoursearchdomain.yourregion.cloudsearch.amazonaws.com
# Get ids of all existing documents, reformat as
# [{ type: "delete", id: "ID" }, ...] using jq
aws cloudsearchdomain search \
--endpoint-url=$CS_DOMAIN \
--size=10000 \
--query-parser=structured \
--search-query="matchall" \
| jq '[.hits.hit[] | {type: "delete", id: .id}]' \
> delete-all.json
# Delete the documents
aws cloudsearchdomain upload-documents \
--endpoint-url=$CS_DOMAIN \
--content-type='application/json' \
--documents=delete-all.json
For more info on jq see Reshaping JSON with jq
Update Feb 22, 2017
Added --size
to get the maximum number of documents (10,000) at a time. You may need to repeat this script multiple times. Also, --search-query
can take something more specific, if you want to be selective about the documents getting deleted.
Best answer I've been able to find was somewhat buried in the AWS docs. To wit:
Amazon CloudSearch currently does not provide a mechanism for deleting all of the documents in a domain. However, you can clone the domain configuration to start over with an empty domain. For more information, see Cloning an Existing Domain's Indexing Options.
Source: http://docs.aws.amazon.com/cloudsearch/latest/developerguide/Troubleshooting.html#ts.cleardomain
On my side, I used a local nodejs script like this:
var AWS = require('aws-sdk');
AWS.config.update({
accessKeyId: '<your AccessKey>',
secretAccessKey: '<Your secretAccessKey>',
region: '<your region>',
endpoint: '<your CloudSearch endpoint'
});
var params = {
query:"(or <your facet.FIELD:'<one facet value>' facet.FIELD:'<one facet value>')",
queryParser:'structured'
};
var cloudsearchdomain = new AWS.CloudSearchDomain(params);
cloudsearchdomain.search(params, function(err, data) {
var fs = require('fs');
var result = [];
if (err) {
console.log("Failed");
console.log(err);
} else {
resultMessage = data;
for(var i=0;i<data.hits.hit.length;i++){
result.push({"type":"delete","id":data.hits.hit[i].id});
}
fs.writeFile("delete.json", JSON.stringify(result), function(err) {
if(err) {return console.log(err);}
console.log("The file was saved!");
});
}
});
You have to know at least all the values of on facets, to be able to request all IDs. In my code, I just put 2 (in (or ....) section), but you can have more.
Once it is done, you have one delete.json file to be used with AWS-CLI using this command :
aws cloudsearchdomain upload-documents --documents delete.json --content-type application/json --endpoint-url <your CloudSearch endpoint>
... that did the job for me !
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With