Search Relevance Using BM25 Algorithm
Here, I am implementing the BM25 algorithm to retrieve most relevant items out of a set of strings, given a search query. This is the first part of implementing a bare-bones version of ElasticSearch which also uses the BM25 algorithm internally.
Before getting into the implementation, following is a quick run down of the algorithm itself.
- Documents are strings, containing words (including repeating words).
- New documents can be inserted in the store, given a corresponding ID.
- Documents can be searched using a query string containing one or more words.
- The search function returns the most relevant documents first.
- Relevance is defined by BM25 to be composed of two factors: term frequency and inverse document frequency
- Term frequency is the number of times a word appears in a document. For example, if the word ‘Clojure’ appears 10 times in one document and 2 times in another document, and the query includes the word ‘Clojure’, the first document will be more relevant.
- Inverse document frequency reduces the impact of commonly occuring words. For example, most documents will contain words like ‘this’ and ‘that’; but few documents will contain the word ‘Clojure’. Document frequency tracks the number of documents in which a word appears. So, if the query term contains ‘this Clojure’, the contribution of the word ‘this’ to the final score will be less than the word ‘Clojure’ because, inverse document frequency is defined as 1 / document frequency.
More mathamatically, the scoring function which decides how relevant a document is, given a query, is defined as follows:
This logic is implemented in this repository. Following is a demo of the final result:
To insert a document, use the following text box:
Now to search for the inserted documents, use the following.
As an added utility, an IMDB dataset is available to index 1500 movie plots from my GitHub repository, which can be inserted here:
Now, you, for example, you can search with the string “Atlantic Iceberg” and see the movie Titanic as a result🤞.
The code used for this implementation is available here: https://github.com/oitee/bm25. Pull requests are most welcome!
In a future post, I will compare this implementation with ElasticSearch’s results and also provide a REST interface to interact from the external world with this implementation. Stay tuned!