Using a custom function to score documents in Elastic Search

At my previous position ElasticSearch formed an important part of our back-end infrastructure, and over the past year and a half we were expanding our use of ElasticSearch. It wasn’t our primary data store but it became a highly important secondary store that we used to query our data in ways that would be difficult or prohibitive in our MySQL database.

In general, the way you will interact with ElasticSearch is to use the REST API. Working with ES means learning this API, just as working with a relational database means understanding SQL.  For the past year and a half I’ve been working with ES and having come to an understanding of it. I’ve been wanting to write up some of my notes and observations on ES and so this short article will probably be the first of a few such posts on working with ES.

After you have added your set of documents in ES you will of course do different kinds of queries.  ES will then retrieve the documents that match and compute a score value that is used to order the results.  When you get back your results you can see this value as _score.

What if the way ES scores documents doesn’t work for you though?  What can you do then? It turns out that there is support for providing your own scoring function for documents.

A small example

To illustrate this there is a very simplified library catalog index, the data, mapping, and queries for which are in a github project. This was run with ES 2.4.1.  (Presumably this should work with ES 5 but that is still in Alpha as of this writing and I haven’t tested it out yet.)

One of the features we used in most all of our queries was the ability to create a custom score instead of letting ES score documents.  To illustrate the mechanics of this let’s say we create an index and add several documents to it. In this example we have a small set of books in an index named ‘library’.

{
  "from" : 0,
  "size" : 10,
  "query": {
    "bool": {
      "must": [ 
           { "match": { "title" : "world" } } 
      ]
    }
  }
}

When you run this query you will receive back results ordered according to the score computed by ElasticSearch.  Now that we have this basic query we’ll modify it to do our own custom formula for scoring.

To introduce our own formula we add a function_score element that includes our original query and then specifies a script score.

{
 "from": 0,
 "size": 10,
 "query": {
 "function_score": {
      "query": {
           "bool": {
                "must": [
                   { "match": {"title": "world" } }
                ]
        }
 },
  "functions": [
         {
           "script_score": {
             "script": "(doc['pub_year'].value / 1000) * 2"
           }
         }
       ]
    }
   }
 }

In this case our script just took the publication year, divided by 1000 and then multiplied it by 2, which is admittedly sort of random…. but the point is just to illustrate the mechanics of how the query is constructed and to show you can access other fields in the document.

Configuring ES to allow in-line scripts

If you run this query on a vanilla ES 2.4.x install then you will likely see the following error message:

{
    "error": {
        "root_cause": [
            {
                "type": "script_exception",
                "reason": "scripts of type [inline], operation [search] and lang [groovy] are disabled"
            }
        ],
        ...
    "status": 400
}

To get around this problem you need to edit the ~elasticsearch/config/elasticsearch.yml file to add a couple of lines:

script.inline: true
script.indexed: true

You will then need to restart ElasticSearch for this change to take effect.

NOTE:  This is sorta important! Making this change in ES is essentially allowing for code to be run via a query. You will want to be very sure to have ES nicely isolated in your network and behind some other API if you run this for real in production where you control what queries go to this.

The Gotcha …

Now that we’ve done that we run into another problem. If you look at our results you will see that the actual score is not just the publication year / 1000 and multiple by 2, but some other number.

"hits" : {
    "total" : 5,
    "max_score" : 2.1166306,
    "hits" : [ {
      "_index" : "library",
      "_type" : "books",
      "_id" : "AVfQRXHBmdIL8FYiyaJH",
      "_score" : 2.1166306,
      "fields" : {
        "title" : [ "Empire: How Britain Made the Modern World" ],
        "pub_year" : [ 2008 ]
      }
    }, {
      "_index" : "library",
      "_type" : "books",
      "_id" : "AVfQRXHBmdIL8FYiyaJS",
      "_score" : 1.620065,
      "fields" : {
        "title" : [ "The Russian Origins of the First World War" ],
        "pub_year" : [ 2013 ]
      }
    }, {
      "_index" : "library",
      "_type" : "books",
      "_id" : "AVfQRXHBmdIL8FYiyaJJ",
      "_score" : 1.6184554,
      "fields" : {
        "title" : [ "Empires in World History: Power and the Politics of Difference" ],
        "pub_year" : [ 2011 ]
      }
    }, {
      "_index" : "library",
      "_type" : "books",
      "_id" : "AVfQRXHBmdIL8FYiyaJR",
      "_score" : 1.4131951,
      "fields" : {
        "title" : [ "Immanuel Wallerstein and the Problem of the World: System, Scale, Culture" ],
        "pub_year" : [ 2011 ]
      }
    }, {
      "_index" : "library",
      "_type" : "books",
      "_id" : "AVfQRXHBmdIL8FYiyaJK",
      "_score" : 1.256875,
      "fields" : {
        "title" : [ "How to Change the World: Reflections on Marx and Marxism" ],
        "pub_year" : [ 2011 ]
      }
    } ]
  }

The default behavior in ElasticSearch is to take your script score and multiply the computed _score by it. This can be somewhat unexpected since you explicitly provided a scoring function.  The solution though is to set ‘boost_mode‘ appropriately, which in the case of just using our computed score is to set this to ‘replace’.

{
  "from": 0,
  "size": 10,
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "title": "world"
              }
            }
          ]
        }
      },
      "functions": [
        {
          "script_score": {
            "script": "doc['title'].size() * 2"
          }
        }
      ],
      "boost_mode": "replace"
    }
  }
}

 

External References

Function Score Query | ElasticSearch Reference [2.4]

Enabling Dynamic Scripting | ElasticSearch Reference [2.4]