The process of search can be broken down into 4 steps: Query autocompletion — Suggest query based on first characters typed. Query filtering — Token removal, stemming and lowering. Query augmentation — Adding synonyms and acronym contraction/expansion.
Semantic search is a data searching technique in a which a search query aims to not only find keywords, but to determine the intent and contextual meaning of the the words a person is using for search. Semantics refer to the philosophical study of meaning.
In machine learning, semantic search captures the meaning from inputs of words such as sentences, paragraphs, and more. It implements NLP techniques to understand and process large amounts of text and speech data. This is the pre-processing data stage called text processing.
There is a problem we are trying to solve where we want to do a semantic search on our set of data, i.e we have a domain-specific data (example: sentences talking about automobiles)
Our data is just a bunch of sentences and what we want is to give a phrase and get back the sentences which are:
Let me try giving you an example suppose I search for the phrase "Buying Experience", I should get the sentences like:
I found a car that i liked and the purchase process was
straightforward and easy
I absolutely hated going car shopping, but today i’m glad i did
I want to lay emphasis on the fact that we are looking for contextual similarity and not just a brute force word search.
If the sentence uses different words then also it should be able to find it.
Things that we have already tried:
Open Semantic Search the problem we faced here is generating ontology from the data we have, or for that sake searching for available ontology from different domains of our interest.
Elastic Search(BM25 + Vectors(tf-idf)), we tried this where it gave a few sentences but precision was not that great. The accuracy was bad as well. We tried against a human-curated dataset, it was able to get around 10% of the sentences only.
We tried different embeddings like the once mentioned in sentence-transformers and also went through the example and tried evaluating against our human-curated set and that also had very low accuracy.
We tried ELMO. This was better but still lower accuracy than we expected and there is a cognitive load to decide the cosine value below which we shouldn't consider the sentences. This even applies to point 3.
Any help will be appreciated. Thanks a lot for the help in advance
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With