Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I retrieve all searchable (not deleted) documents in Amazon cloudsearch

I want to retrieve all my searchable document from cloudsearch

I tried to do a negative search like that:

search-[mySearchEndPoint].cloudsearch.amazonaws.com/2011-02-01/search?bq=(not keywords: '!!!testtest!!!')

It work's but it also return all the deleted documents.

So how can I get all active document only?

like image 230
Thermech Avatar asked Jan 28 '13 16:01

Thermech


People also ask

Is Amazon CloudSearch deprecated?

Amazon search is deprecated: Amazon search service is no longer supported. To set up a search functionality on your site(s), configure one of the three built-in search services instead. Amazon Cloud Search is deprecated in Sitefinity 13.3. 7600.0.

How do I delete all files in CloudSearch?

Amazon CloudSearch currently does not provide a mechanism for deleting all of the documents in a domain. However, you can clone the domain configuration to start over with an empty domain. For more information, see Cloning an Existing Domain's Indexing Options.

What does Amazon Cloud Search enable you to do?

Amazon CloudSearch is a managed service in the AWS Cloud that makes it simple and cost-effective to set up, manage, and scale a search solution for your website or application. Amazon CloudSearch supports 34 languages and popular search features such as highlighting, autocomplete, and geospatial search.

Which of the following are the two services on AWS you can choose for searches?

AWS, like many things, offers not one, but two services for building cost-effective, high throughput, low latency search solutions: CloudSearch and ElasticSearch.


2 Answers

The key thing to know is that CloudSearch doesn't really delete. Instead, the "delete" function retains IDs in the index, but clears all fields in those deleted docs, including setting uint fields to 0. This works fine for positive queries, which will match no text in the cleared, "deleted" docs.

A workaround is to add a uint field to your docs, called 'updated' below, to use as a filter for queries that might return deleted IDs, such as negative queries.

(The samples below uses the Boto interface library for CloudSearch, with many steps omitted for brevity.)

When you add docs, set the field to the current timestamp

doc['updated'] = now_utc  # unix time in seconds; useful for 'version' also.
doc_service.add(id, now_utc, doc)
conn.commit()

when you delete, CloudSearch sets uint fields to 0:

doc_service.delete(id, now_utc)
conn.commit()
# CloudSearch sets doc's 'updated' field = 0

Now you can distinguish between deleted and active docs in a negative query. The samples below are searching a test index with 86 docs, about half of them deleted.

# negative query that shows both active and deleted IDs
neg_query = "title:'-foobar'"
results = search_service.search(bq=neg_query)
results.hits  # 86 docs in a test index

# deleted items
deleted_query = "updated:0"
results = search_service.search(bq=deleted_query)
results.hits  # 46 of them have been deleted

# negative, filtered query that lists only active
filtered_query = "(and updated:1.. title:'-foobar')"
results = search_service.search(bq=filtered_query)
results.hits  # 40 active docs
like image 116
larham1 Avatar answered Sep 27 '22 18:09

larham1


I think you can do that like this:

search-[mySearchEndPoint].cloudsearch.amazonaws.com/2011-02-01/search?bq=-impossibleTermToSearch

Attention to the '-' in the begin of the term

like image 20
Everton Yoshitani Avatar answered Sep 27 '22 18:09

Everton Yoshitani