Semantic Search Engine with ElasticSearch

By Dhnesh on March 14, 2022, 8:05 p.m.

What is ElasticSearch:

ElasticSearch is a No-SQL based database. It is a distributed open-source engine based on Java’s Lucene library. It is easy to use and highly scalable. Best suited for text data, so most of the NLP projects use ElasticSearch for saving and processing a large amount of textual data. 


Semantic Search:
Before AI come to the mainstream, our computer algorithms can only find or matches text or certain strings.
But with Sophisticated Neural Networks, AI can understand searcher intent, query context, and the relationship between words. Seems like magic. Well, it's maths in a nutshell. Semantic Searches move beyond a static list of the meaning of a query to understand a specific context. With learning from past results and creating links between entities, AI can use this contextual meaning of terms in a database to generate more relevant and accurate results.

So how to combine semantic search with the power of elastic search to download powerful search engines?

You can build extremely, extremely powerful search engines on the data that you have. Elasticsearch is easy to configure, offers ease of use APIs to ingest data from any source. Before ingesting data into elasticsearch, we want to clean our data. There are many preprocessing techniques like Tokenization, removing stop words, lemmas, etc to remove unwanted and biased words which would add no meaning to our model. But we do not have to write our custom pipelines for that. Elasticsearch has built-in analyzers which would help us write custom code. With these analyzers, you can choose the rules which suited best for your model. Once this is done we can ingest our data in the ElasticSearch index.


Word Embeddings - 
A word embedding model tells a word as a high dimension dense vector. These vectors target to capture contextual properties of each word — words whose vectors are close would be near in n dimension space, hence similar in terms of semantic meaning. As an example, the vector for "India" might be close to "Delhi" in one direction. Word Embeddings is a learned representation of text.
There are many models like Word2vec or Glove which are trained on millions of words. My personal choice is MultiLingual Sentence Transformers Bert Model. Now once our sentences are one by one passed through this model we would be having the output of vectors, generally 768 dimensions. 


Dense Vector Field -
ElasticSearch offers a dense vector field to store the above-created vectors. The dense_vector field type stores dense vectors of float values. And once stored, now our target is to test the semantic search. Once the search query is passed through ElasticSearch API, we can use cosine similarity to find the nearest matched vectors. Cosine similarity is a criterion to measure whether two vectors are pointing in the same direction with a score between 0-1. With this score, we can measure document similarity in text analysis.