I am experimenting with elasticsearch as a search server and my task is to build a "semantic" search functionality. From a short text phrase like "I have a burst pipe" the system should infer that the user is searching for a plumber and return all plumbers indexed in elasticsearch.
Can that be done directly in a search server like elasticsearch or do I have to use a natural language processing (NLP) tool like e.g. Maui Indexer. What is the exact terminology for my task at hand, text classification? Though the given text is very short as it is a search phrase.
Semantic Search, a form of search usually used in search engines, serves content to the users understanding the intent and meaning of the user's search query. This search is a step ahead of the traditional text and keyword match search.
By integrating with one of the most popular formats for building NLP models in PyTorch models, Elasticsearch can provide a platform that works with a large variety of NLP tasks and use cases.
In machine learning, semantic search captures the meaning from inputs of words such as sentences, paragraphs, and more. It implements NLP techniques to understand and process large amounts of text and speech data. This is the pre-processing data stage called text processing.
Elasticsearch is a distributed, open-source search and analytics engine built on Apache Lucene and developed in Java. It started as a scalable version of the Lucene open-source search framework then added the ability to horizontally scale Lucene indices.
There may be several approaches with different implementation complexity.
The easiest one is to create list of topics (like plumbing), attach bag of words (like "pipe"), identify search request by majority of keywords and search only in specified topic (you can add field topic
to your elastic search documents and set it as mandatory with +
during search).
Of course, if you have lots of documents, manual creation of topic list and bag of words is very time expensive. You can use machine learning to automate some of tasks. Basically, it is enough to have distance measure between words and/or documents to automatically discover topics (e.g. by data clustering) and classify query to one of these topics. Mix of these techniques may also be a good choice (for example, you can manually create topics and assign initial documents to them, but use classification for query assignment). Take a look at Wikipedia's article on latent semantic analysis to better understand the idea. Also pay attention to the 2 linked articles on data clustering and document classification. And yes, Maui Indexer may become good helper tool this way.
Finally, you can try to build an engine that "understands" meaning of the phrase (not just uses terms frequency) and searches appropriate topics. Most probably, this will involve natural language processing and ontology-based knowledgebases. But in fact, this field is still in active research and without previous experience it will be very hard for you to implement something like this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With