Bulk loading data into ElasticSearch

Every database needs to have a way to load multiple records into it. With MySQL you can import a csv file, for instance.  ElasticSearch has a way too and it generally works well.  There’s just a couple important points to working with it that can trip one up.

  1. Bulk API entry point.In this example let’s say we have an index called ‘library’ (referenced in the previous blog post on ES) and in that we have a type named, simply, ‘books’.  The API url will be have _bulk on the end of it.

    http://localhost:9200/library/books/_bulk

    Now, as it happens… the index and type parts of this are pretty much not needed because you can put that into the data file itself.  So all you really need is this:

    http://localhost:9200/_bulk

  2. You can do a curl POST to this endpoint and load your data file.  The second point is to use the –data-binary option to preserve newline characters.  We can also reference our actual data file using @data_file_path.  So our curl call looks like this:curl -XPOST ‘http://localhost:9200/library/books/_bulk’ –data-binary @library_entries.json
  3. Now.. about the data file itself.  This is not exactly a legal JSON format, but rather a series of json entries.  Here is an abbreviated sample:
    { "index" : { "_index" : "library", "_type": "books" }}
    { "isbn13" : "978-0553290998", "title" : "Nightfall", "authors" : "Isaac Asimov, Robert Silverberg" }
    { "index" : { "_index" : "library", "_type" : "books" } }
    { "isbn13"   : "978-0141007540", "title"  : "Empire: How Britain Made the Modern World", "authors": "Niall Ferguson" }
    { "index" : { "_index" : "library", "_type" : "books" } }
    { "isbn13" : "978-0199931156", "title" : "The Rule of Empires", "authors" : "Timothy H. Parsons" }
    
    

    You want to view the data file as pairs of lines.  The first line tells ES what to do — in this case index (ie. insert) an entry into a specified index and type.  You can also delete or update content too.

    The second line is the actual content.  This is pretty straight-forward as having the data for the entry.  The one trick here is that you have to have the full entry on ONE LINE.  If you try to split it up (say, to make the thing more readable) then that will confuse ES.  So one line per entry.

  4. Finally… you must have a last blank line in your data file!  If you don’t then you will likely lose the last entry.

External References
Bulk API | ElasticSearch [2.4]

Using a custom function to score documents in Elastic Search

At my previous position ElasticSearch formed an important part of our back-end infrastructure, and over the past year and a half we were expanding our use of ElasticSearch. It wasn’t our primary data store but it became a highly important secondary store that we used to query our data in ways that would be difficult or prohibitive in our MySQL database.

In general, the way you will interact with ElasticSearch is to use the REST API. Working with ES means learning this API, just as working with a relational database means understanding SQL.  For the past year and a half I’ve been working with ES and having come to an understanding of it. I’ve been wanting to write up some of my notes and observations on ES and so this short article will probably be the first of a few such posts on working with ES.

After you have added your set of documents in ES you will of course do different kinds of queries.  ES will then retrieve the documents that match and compute a score value that is used to order the results.  When you get back your results you can see this value as _score.

What if the way ES scores documents doesn’t work for you though?  What can you do then? It turns out that there is support for providing your own scoring function for documents.

A small example

To illustrate this there is a very simplified library catalog index, the data, mapping, and queries for which are in a github project. This was run with ES 2.4.1.  (Presumably this should work with ES 5 but that is still in Alpha as of this writing and I haven’t tested it out yet.)

One of the features we used in most all of our queries was the ability to create a custom score instead of letting ES score documents.  To illustrate the mechanics of this let’s say we create an index and add several documents to it. In this example we have a small set of books in an index named ‘library’.

{
  "from" : 0,
  "size" : 10,
  "query": {
    "bool": {
      "must": [ 
           { "match": { "title" : "world" } } 
      ]
    }
  }
}

When you run this query you will receive back results ordered according to the score computed by ElasticSearch.  Now that we have this basic query we’ll modify it to do our own custom formula for scoring.

To introduce our own formula we add a function_score element that includes our original query and then specifies a script score.

{
 "from": 0,
 "size": 10,
 "query": {
 "function_score": {
      "query": {
           "bool": {
                "must": [
                   { "match": {"title": "world" } }
                ]
        }
 },
  "functions": [
         {
           "script_score": {
             "script": "(doc['pub_year'].value / 1000) * 2"
           }
         }
       ]
    }
   }
 }

In this case our script just took the publication year, divided by 1000 and then multiplied it by 2, which is admittedly sort of random…. but the point is just to illustrate the mechanics of how the query is constructed and to show you can access other fields in the document.

Configuring ES to allow in-line scripts

If you run this query on a vanilla ES 2.4.x install then you will likely see the following error message:

{
    "error": {
        "root_cause": [
            {
                "type": "script_exception",
                "reason": "scripts of type [inline], operation [search] and lang [groovy] are disabled"
            }
        ],
        ...
    "status": 400
}

To get around this problem you need to edit the ~elasticsearch/config/elasticsearch.yml file to add a couple of lines:

script.inline: true
script.indexed: true

You will then need to restart ElasticSearch for this change to take effect.

NOTE:  This is sorta important! Making this change in ES is essentially allowing for code to be run via a query. You will want to be very sure to have ES nicely isolated in your network and behind some other API if you run this for real in production where you control what queries go to this.

The Gotcha …

Now that we’ve done that we run into another problem. If you look at our results you will see that the actual score is not just the publication year / 1000 and multiple by 2, but some other number.

"hits" : {
    "total" : 5,
    "max_score" : 2.1166306,
    "hits" : [ {
      "_index" : "library",
      "_type" : "books",
      "_id" : "AVfQRXHBmdIL8FYiyaJH",
      "_score" : 2.1166306,
      "fields" : {
        "title" : [ "Empire: How Britain Made the Modern World" ],
        "pub_year" : [ 2008 ]
      }
    }, {
      "_index" : "library",
      "_type" : "books",
      "_id" : "AVfQRXHBmdIL8FYiyaJS",
      "_score" : 1.620065,
      "fields" : {
        "title" : [ "The Russian Origins of the First World War" ],
        "pub_year" : [ 2013 ]
      }
    }, {
      "_index" : "library",
      "_type" : "books",
      "_id" : "AVfQRXHBmdIL8FYiyaJJ",
      "_score" : 1.6184554,
      "fields" : {
        "title" : [ "Empires in World History: Power and the Politics of Difference" ],
        "pub_year" : [ 2011 ]
      }
    }, {
      "_index" : "library",
      "_type" : "books",
      "_id" : "AVfQRXHBmdIL8FYiyaJR",
      "_score" : 1.4131951,
      "fields" : {
        "title" : [ "Immanuel Wallerstein and the Problem of the World: System, Scale, Culture" ],
        "pub_year" : [ 2011 ]
      }
    }, {
      "_index" : "library",
      "_type" : "books",
      "_id" : "AVfQRXHBmdIL8FYiyaJK",
      "_score" : 1.256875,
      "fields" : {
        "title" : [ "How to Change the World: Reflections on Marx and Marxism" ],
        "pub_year" : [ 2011 ]
      }
    } ]
  }

The default behavior in ElasticSearch is to take your script score and multiply the computed _score by it. This can be somewhat unexpected since you explicitly provided a scoring function.  The solution though is to set ‘boost_mode‘ appropriately, which in the case of just using our computed score is to set this to ‘replace’.

{
  "from": 0,
  "size": 10,
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "title": "world"
              }
            }
          ]
        }
      },
      "functions": [
        {
          "script_score": {
            "script": "doc['title'].size() * 2"
          }
        }
      ],
      "boost_mode": "replace"
    }
  }
}

 

External References

Function Score Query | ElasticSearch Reference [2.4]

Enabling Dynamic Scripting | ElasticSearch Reference [2.4]