Loading...
Okapi BM25 with Game of Thrones

Okapi BM25 with Game of Thrones

Used:   elasticsearch v7.3.1 

Since Elasticsearch 5, the default similarity algorithm for Elasticsearch is Okapi BM25. A similarity (scoring/ranking model) defines how matching documents are scored. Performing a search against a set of documents gives you results sorted by relevance. In one of our previous blog posts by Rocco Schulz, BM25 was already mentioned. In this blog article, we are going to look into the inner workings of the Okapi BM25 algorithm.

The name of the actual ranking function is BM25. BM stands for best matching. To set the right context, however, it is usually referred to as “Okapi BM25”, since the Okapi information retrieval system, implemented at London’s City University in the 1980s and 1990s, was the first system to implement this function.

I have a product owner that wants an explanation for BM25. My first reaction to that:

I know nothing

Now bear in mind that it is a super exciting topic. I have to explain a little bit the theory behind before I demonstrate the scoring algorithm. Elasticsearch is based on Apache Lucene. To understand Elasticsearch, we have to understand Apache Lucene.

I choose relatable data to make it more exciting and fun. Game of Thrones comes to the rescue, or more precisely quotes from 5 Game of Thrones books used as data.

Okapi BM25 Basics

Before we start, we need to cover some basics. Okapi BM25 is no joke. Have a look at that monster formula:

BM

It feels like riding a dragon. D is the document and Q is the query. k1 and b are free parameters. In Apache Lucene, k1 is 1.2 and b is 0.75. The source of truth is the lucene BM25Similarity implementation.

Dragon Ride

TF - Term Frequency

The Term Frequency measures how frequently a term occurs in a document. Since every text is different in length, a word could appear much more times in long documents than shorter ones. The more often a term appears, the better its score becomes.

The term frequency is expressed as qi in the length of document (D).

TF Formula

Simply put:

      (Number of times term t appears in a document)
TF =  ---------------------------------------------- 
          (Total number of terms in the document)

IDF - Inverse Document Frequency

The Inverse Document Frequency measures how important a term is. However, it is known that specific words, such as “the”, “is”, “of”, and “that”, may appear a lot of times but have little importance.

Thus we need to weigh down the common terms while scaling up the rare ones, by computing the following:

IDF Formula

Simply put:

                (Total number of documents) 
IDF = log  ---------------------------------------
           (Number of documents with term t in it)

Setup for Data Exploration

I use Elasticsearch v7.3.1. Create the index got with the following command in your Elasticsearch instance.

curl -XPUT "http://localhost:9200/got" -H 'Content-Type: application/json' -d'
{
  "settings": { "number_of_shards": 1, "number_of_replicas": 0},
  "mappings": {
    "properties": {
      "quote": { "type": "text", "analyzer": "english" }
    }
  }
}'

Use the Bulk API to ingest the quotes into your Elasticsearch instance.

POST _bulk
{ "index" : { "_index" : "got", "_id" : "1" } }
{ "quote" : "A mind needs books as a sword needs a whetstone, if it is to keep its edge." }
{ "index" : { "_index" : "got", "_id" : "2" } }
{ "quote" : "Never forget what you are, for surely the world will not. Make it your strength. Then it can never be your weakness. Armor yourself in it, and it will never be used to hurt you." }
{ "index" : { "_index" : "got", "_id" : "3" } }
{ "quote" : "Let them see that their words can cut you, and you’ll never be free of the mockery. If they want to give you a name, take it, make it your own. Then they can’t hurt you with it anymore." }
{ "index" : { "_index" : "got", "_id" : "4" } }
{ "quote" : "When you play the game of thrones, you win or you die. There is no middle ground." }
{ "index" : { "_index" : "got", "_id" : "5" } }
{ "quote" : "The common people pray for rain, healthy children, and a summer that never ends. It is no matter to them if the high lords play their game of thrones, so long as they are left in peace." }
{ "index" : { "_index" : "got", "_id" : "6" } }
{ "quote" : "If you would take a man’s life, you owe it to him to look into his eyes and hear his final words. And if you cannot bear to do that, then perhaps the man does not deserve to die." }
{ "index" : { "_index" : "got", "_id" : "7" } }
{ "quote" : "Sorcery is the sauce fools spoon over failure to hide the flavor of their own incompetence." }
{ "index" : { "_index" : "got", "_id" : "8" } }
{ "quote" : "Power resides where men believe it resides. No more and no less." }
{ "index" : { "_index" : "got", "_id" : "9" } }
{ "quote" : "There’s no shame in fear, my father told me, what matters is how we face it." }
{ "index" : { "_index" : "got", "_id" : "10" } }
{ "quote" : "Love is poison. A sweet poison, yes, but it will kill you all the same." }
{ "index" : { "_index" : "got", "_id" : "11" } }
{ "quote" : "What good is this, I ask you? He who hurries through life hurries to his grave." }
{ "index" : { "_index" : "got", "_id" : "12" } }
{ "quote" : "Old stories are like old friends, she used to say. You have to visit them from time to time." }
{ "index" : { "_index" : "got", "_id" : "13" } }
{ "quote" : "The greatest fools are ofttimes more clever than the men who laugh at them." }
{ "index" : { "_index" : "got", "_id" : "14" } }
{ "quote" : "Everyone wants something, Alayne. And when you know what a man wants you know who he is, and how to move him." }
{ "index" : { "_index" : "got", "_id" : "15" } }
{ "quote" : "Always keep your foes confused. If they are never certain who you are or what you want, they cannot know what you are like to do next. Sometimes the best way to baffle them is to make moves that have no purpose, or even seem to work against you." }
{ "index" : { "_index" : "got", "_id" : "16" } }
{ "quote" : "One voice may speak you false, but in many there is always truth to be found." }
{ "index" : { "_index" : "got", "_id" : "17" } }
{ "quote" : "History is a wheel, for the nature of man is fundamentally unchanging." }
{ "index" : { "_index" : "got", "_id" : "18" } }
{ "quote" : "Knowledge is a weapon, Jon. Arm yourself well before you ride forth to battle." }
{ "index" : { "_index" : "got", "_id" : "19" } }
{ "quote" : "I prefer my history dead. Dead history is writ in ink, the living sort in blood." }
{ "index" : { "_index" : "got", "_id" : "20" } }
{ "quote" : "In the game of thrones, even the humblest pieces can have wills of their own. Sometimes they refuse to make the moves you’ve planned for them. Mark that well, Alayne. It’s a lesson that Cersei Lannister still has yet to learn." }
{ "index" : { "_index" : "got", "_id" : "21" } }
{ "quote" : "Every man should lose a battle in his youth, so he does not lose a war when he is old." }
{ "index" : { "_index" : "got", "_id" : "22" } }
{ "quote" : "A reader lives a thousand lives before he dies. The man who never reads lives only one." }
{ "index" : { "_index" : "got", "_id" : "23" } }
{ "quote" : "The fisherman drowned, but his daughter got Stark to the Sisters before the boat went down. They say he left her with a bag of silver and a bastard in her belly. Jon Snow, she named him, after Arryn." }
{ "index" : { "_index" : "got", "_id" : "24" } }
{ "quote" : "You could make a poultice out of mud to cool a fever. You could plant seeds in mud and grow a crop to feed your children. Mud would nourish you, where fire would only consume you, but fools and children and young girls would choose fire every time." }
{ "index" : { "_index" : "got", "_id" : "25" } }
{ "quote" : "Men live their lives trapped in an eternal present, between the mists of memory and the sea of shadow that is all we know of the days to come." }
{ "index" : { "_index" : "got", "_id" : "26" } }
{ "quote" : "No. Hear me, Daenerys Targaryen. The glass candles are burning. Soon comes the pale mare, and after her the others. Kraken and dark flame, lion and griffin, the sun’s son and the mummer’s dragon. Trust none of them. Remember the Undying. Beware the perfumed seneschal." }

Search for Wisdom

Now let’s search for quotes, that contains the term live in it.

GET got/_search
{"query":{"match":{"quote":"live"}}}

We get these quotes based on their score (ranking) from BM25.

| Score     | Quote                                                                            |
| --------- | -------------------------------------------------------------------------------- |
| 3.3297362 | A reader lives a thousand lives before he dies.                                  |
|           | The man who never reads lives only one.                                          |                                          
| --------- | -------------------------------------------------------------------------------- |
| 2.847715  | Men live their lives trapped in an eternal present, between the mists of memory  |
|           | and the sea of shadow that is all we know of the days to come.                   |
| --------- | -------------------------------------------------------------------------------- |
| 2.313831  | I prefer my history dead. Dead history is writ in ink, the living sort in blood. |
| --------- | -------------------------------------------------------------------------------- |

Scoring Explained

To understand the score, you can always add the explain parameter to each query.

GET got/_search
{"query":{"match":{"quote":"live"}},"explain":true}

Let us examine the score for the following quote in detail.

A reader lives a thousand lives before he dies. The man who never reads lives only one.

The top-down approach is beneficial to understand the calculation.

The score 3.3297362 is the sum of the BM25-score of the term live.

{
  "_explanation": {
    "value": 3.3297362,
    "description": "weight(quote:live in 21) [PerFieldSimilarity], result of:",
    "details": [
      {
        "value": 3.3297362,
        "description": "score(freq=3.0), product of:",
        "details": [
          {
            "value": 2.2,
            "description": "boost"
          },
          {
            "value": 2.043074,
            "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:"
          },
          {
            "value": 0.74080354,
            "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:"
          }
        ]
      }
    ]
  }
}

Or simply written:

2.2 * 2.043074 * 0.74080354 = 3.3297362

Term Frequency Explained

Now let us look into the Term Frequency calculation.

{
  "value" : 0.74080354,
  "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
  "details" : [
    {
      "value" : 3.0,
      "description" : "freq, occurrences of term within document",
      "details" : [ ]
    },
    {
      "value" : 1.2,
      "description" : "k1, term saturation parameter",
      "details" : [ ]
    },
    {
      "value" : 0.75,
      "description" : "b, length normalization parameter",
      "details" : [ ]
    },
    {
      "value" : 14.0,
      "description" : "dl, length of field",
      "details" : [ ]
    },
    {
      "value" : 16.807692,
      "description" : "avgdl, average length of field",
      "details" : [ ]
    }
  ]
}

The details of the Term Frequency explanation:

  • freq = term live appears three times in the document
  • k1 is 1.2
  • b is 0.75
  • dl = the document length (number of tokens) is 14
  • avgdl = the average document length of all quotes is 16.807692

If I take above formula and calculate it with my old school calculator:

freq / (freq + k1 * (1 - b + b * dl / avgdl) )
=====================================================
3 / ( 3 + 1.2 * ( 1 - 0.75 + 0.75 * 14 / 16.807692) ) 
3 / ( 3 + 1.2 * 0.87471397)
3 / 4.049656764 = 0.740803524

0.740803524 has a minor discrepancy to 0.74080354.

If we check the formula for the quote:

Men live their lives trapped in an eternal present, between the mists of memory and the sea of shadow that is all we know of the days to come.

We get

  • freq = term live appears two times in the document
  • k1 is 1.2
  • b is 0.75
  • dl = the document length (number of tokens) is 16
  • avgdl = the average document length of all quotes is 16.807692
freq / (freq + k1 * (1 - b + b * dl / avgdl) )
=====================================================
2 / ( 2 + 1.2 * ( 1 - 0.75 + 0.75 * 16 / 16.807692) ) 
2 / ( 2 + 1.2 * 0.963958823 )
2 / 3.156750588 = 0.633562882

The first quote term frequency is higher (0.74080354 > 0.633562882).

Inverse Document Frequency Explained

The IDF explanation of Elasticsearch:

  • N = 26 documents in Elasticsearch with the quote field
  • n = 3 documents with the term live
{
  "value" : 2.043074,
  "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
  "details" : [
    {
      "value" : 3,
      "description" : "n, number of documents containing term",
      "details" : [ ]
    },
    {
      "value" : 26,
      "description" : "N, total number of documents with field",
      "details" : [ ]
    }
  ]
}

Calculation as implemented in Apache Lucence

long docCount = 26L;
long docFreq = 3L;
float idf = (float) Math.log(1 + (docCount - docFreq + 0.5D)/(docFreq + 0.5D));
System.out.println(idf); //2.043074

Summary

In this article, you have learned about the inner workings of Okapi BM25 and their usage in Elasticsearch and Apache Lucene.

BM25 is fun

Now you have all the knowledge to explain BM25 proficiently. As a final challenge, try to understand the scoring for this search.

GET got/_search
{"query":{"match":{"quote":"fool"}}}

In the end, I conclude with the wisdom of George R. Martin:

A reader lives a thousand lives before he dies. The man who never reads lives only one.

Please remember the terms for blog comments.