Category Archives: Databases

Bulk loading data into ElasticSearch

Every database needs to have a way to load multiple records into it. With MySQL you can import a csv file, for instance.  ElasticSearch has a way too and it generally works well.  There’s just a couple important points to working with it that can trip one up.

  1. Bulk API entry point.In this example let’s say we have an index called ‘library’ (referenced in the previous blog post on ES) and in that we have a type named, simply, ‘books’.  The API url will be have _bulk on the end of it.

    http://localhost:9200/library/books/_bulk

    Now, as it happens… the index and type parts of this are pretty much not needed because you can put that into the data file itself.  So all you really need is this:

    http://localhost:9200/_bulk

  2. You can do a curl POST to this endpoint and load your data file.  The second point is to use the –data-binary option to preserve newline characters.  We can also reference our actual data file using @data_file_path.  So our curl call looks like this:curl -XPOST ‘http://localhost:9200/library/books/_bulk’ –data-binary @library_entries.json
  3. Now.. about the data file itself.  This is not exactly a legal JSON format, but rather a series of json entries.  Here is an abbreviated sample:
    { "index" : { "_index" : "library", "_type": "books" }}
    { "isbn13" : "978-0553290998", "title" : "Nightfall", "authors" : "Isaac Asimov, Robert Silverberg" }
    { "index" : { "_index" : "library", "_type" : "books" } }
    { "isbn13"   : "978-0141007540", "title"  : "Empire: How Britain Made the Modern World", "authors": "Niall Ferguson" }
    { "index" : { "_index" : "library", "_type" : "books" } }
    { "isbn13" : "978-0199931156", "title" : "The Rule of Empires", "authors" : "Timothy H. Parsons" }
    
    

    You want to view the data file as pairs of lines.  The first line tells ES what to do — in this case index (ie. insert) an entry into a specified index and type.  You can also delete or update content too.

    The second line is the actual content.  This is pretty straight-forward as having the data for the entry.  The one trick here is that you have to have the full entry on ONE LINE.  If you try to split it up (say, to make the thing more readable) then that will confuse ES.  So one line per entry.

  4. Finally… you must have a last blank line in your data file!  If you don’t then you will likely lose the last entry.

External References
Bulk API | ElasticSearch [2.4]

Using a custom function to score documents in Elastic Search

At my previous position ElasticSearch formed an important part of our back-end infrastructure, and over the past year and a half we were expanding our use of ElasticSearch. It wasn’t our primary data store but it became a highly important secondary store that we used to query our data in ways that would be difficult or prohibitive in our MySQL database.

In general, the way you will interact with ElasticSearch is to use the REST API. Working with ES means learning this API, just as working with a relational database means understanding SQL.  For the past year and a half I’ve been working with ES and having come to an understanding of it. I’ve been wanting to write up some of my notes and observations on ES and so this short article will probably be the first of a few such posts on working with ES.

After you have added your set of documents in ES you will of course do different kinds of queries.  ES will then retrieve the documents that match and compute a score value that is used to order the results.  When you get back your results you can see this value as _score.

What if the way ES scores documents doesn’t work for you though?  What can you do then? It turns out that there is support for providing your own scoring function for documents.

A small example

To illustrate this there is a very simplified library catalog index, the data, mapping, and queries for which are in a github project. This was run with ES 2.4.1.  (Presumably this should work with ES 5 but that is still in Alpha as of this writing and I haven’t tested it out yet.)

One of the features we used in most all of our queries was the ability to create a custom score instead of letting ES score documents.  To illustrate the mechanics of this let’s say we create an index and add several documents to it. In this example we have a small set of books in an index named ‘library’.

{
  "from" : 0,
  "size" : 10,
  "query": {
    "bool": {
      "must": [ 
           { "match": { "title" : "world" } } 
      ]
    }
  }
}

When you run this query you will receive back results ordered according to the score computed by ElasticSearch.  Now that we have this basic query we’ll modify it to do our own custom formula for scoring.

To introduce our own formula we add a function_score element that includes our original query and then specifies a script score.

{
 "from": 0,
 "size": 10,
 "query": {
 "function_score": {
      "query": {
           "bool": {
                "must": [
                   { "match": {"title": "world" } }
                ]
        }
 },
  "functions": [
         {
           "script_score": {
             "script": "(doc['pub_year'].value / 1000) * 2"
           }
         }
       ]
    }
   }
 }

In this case our script just took the publication year, divided by 1000 and then multiplied it by 2, which is admittedly sort of random…. but the point is just to illustrate the mechanics of how the query is constructed and to show you can access other fields in the document.

Configuring ES to allow in-line scripts

If you run this query on a vanilla ES 2.4.x install then you will likely see the following error message:

{
    "error": {
        "root_cause": [
            {
                "type": "script_exception",
                "reason": "scripts of type [inline], operation [search] and lang [groovy] are disabled"
            }
        ],
        ...
    "status": 400
}

To get around this problem you need to edit the ~elasticsearch/config/elasticsearch.yml file to add a couple of lines:

script.inline: true
script.indexed: true

You will then need to restart ElasticSearch for this change to take effect.

NOTE:  This is sorta important! Making this change in ES is essentially allowing for code to be run via a query. You will want to be very sure to have ES nicely isolated in your network and behind some other API if you run this for real in production where you control what queries go to this.

The Gotcha …

Now that we’ve done that we run into another problem. If you look at our results you will see that the actual score is not just the publication year / 1000 and multiple by 2, but some other number.

"hits" : {
    "total" : 5,
    "max_score" : 2.1166306,
    "hits" : [ {
      "_index" : "library",
      "_type" : "books",
      "_id" : "AVfQRXHBmdIL8FYiyaJH",
      "_score" : 2.1166306,
      "fields" : {
        "title" : [ "Empire: How Britain Made the Modern World" ],
        "pub_year" : [ 2008 ]
      }
    }, {
      "_index" : "library",
      "_type" : "books",
      "_id" : "AVfQRXHBmdIL8FYiyaJS",
      "_score" : 1.620065,
      "fields" : {
        "title" : [ "The Russian Origins of the First World War" ],
        "pub_year" : [ 2013 ]
      }
    }, {
      "_index" : "library",
      "_type" : "books",
      "_id" : "AVfQRXHBmdIL8FYiyaJJ",
      "_score" : 1.6184554,
      "fields" : {
        "title" : [ "Empires in World History: Power and the Politics of Difference" ],
        "pub_year" : [ 2011 ]
      }
    }, {
      "_index" : "library",
      "_type" : "books",
      "_id" : "AVfQRXHBmdIL8FYiyaJR",
      "_score" : 1.4131951,
      "fields" : {
        "title" : [ "Immanuel Wallerstein and the Problem of the World: System, Scale, Culture" ],
        "pub_year" : [ 2011 ]
      }
    }, {
      "_index" : "library",
      "_type" : "books",
      "_id" : "AVfQRXHBmdIL8FYiyaJK",
      "_score" : 1.256875,
      "fields" : {
        "title" : [ "How to Change the World: Reflections on Marx and Marxism" ],
        "pub_year" : [ 2011 ]
      }
    } ]
  }

The default behavior in ElasticSearch is to take your script score and multiply the computed _score by it. This can be somewhat unexpected since you explicitly provided a scoring function.  The solution though is to set ‘boost_mode‘ appropriately, which in the case of just using our computed score is to set this to ‘replace’.

{
  "from": 0,
  "size": 10,
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "title": "world"
              }
            }
          ]
        }
      },
      "functions": [
        {
          "script_score": {
            "script": "doc['title'].size() * 2"
          }
        }
      ],
      "boost_mode": "replace"
    }
  }
}

 

External References

Function Score Query | ElasticSearch Reference [2.4]

Enabling Dynamic Scripting | ElasticSearch Reference [2.4]

 

Cypher and Neo4j: Part I

A few months ago I started working with graph databases. This post is part of a series aimed at documenting how to work with a graph database, particularly for those coming from a relational database background.

At a practical level, when first working with a database you want to know how to get it installed and running (which was the subject of an earlier post) and then how to do the basic CRUD operations: creating data, retrieving it, making updates, and then deleting things. The purpose of this post is just to focus on using Neo4j and the query language particular to that database, Cypher.

One of the very nice features of Neo4j 2 is that they included a very friendly way to interact with the database by just pointing your web browser at it. This is not how you will work with Neo4j at scale but when learning to use it, it is invaluable.  When working within the browser you generally enter one statement at a time. If you try to put two statements separated by a semi-colon, Neo4j will get confused.

In Neo4j you work with a property graph data model, which is to say a set of nodes with connecting edges, both of which can have properties — or attributes if you prefer — attached to them. In Neo4j nodes and edges can also have Labels, which you can think of as allowing you to define a type of node or edge. (This is in contrast to Titan, where edges can have labels, but not nodes.)

The first thing to note is that in Neo4j, you do not need to set up a schema first before you can start loading in and using the database. In fact there isn’t really a concept of a schema like you would have in a relational database, and in Cypher there isn’t an equivalent of the Data Definition Language (DDL) that we have in SQL; there aren’t equivalents for ‘create table’, ‘drop table’, ‘alter table’. If you are coming from primarily a relational database background then this feels a bit odd certainly.

Creating a Node

The example below shows how we might create a single node with a couple of attributes. Note that the indentation here is purely to enhance readability — Neo4j will process this fine if it is all on the same line.

CREATE ( Person { name: "Alice",
                  email: "alice@wherever.com"
                }
      )

Cypher is deliberately designed to feel like SQL. Instead of SQL’s ‘INSERT’ though we add new nodes to the database with a ‘CREATE’ command. Now we’ll deconstruct the rest of this statement: The ‘Person’ part is a label that allows you to identify the particular kind of node this is.  Unless you have a graph in which all the nodes are the same type, then you will likely want to add a descriptive label here. The label needs to start with a letter but it can be composed of letters and numbers.  All the properties for this node are in curly braces as a comma-separated list of key-value pairs, with a colon separating keys from values (instead of an equals sign.)

Note that there is nothing here in this statement that identifies a primary key. Nor do we have a concept of referential integrity between kinds of nodes. Enforcing uniqueness is possible using the MERGE command, which we’ll get to in a later post. Internally in Neo4j there is a unique node id and you can make use of that, but you shouldn’t rely on that much.

Just for comparison, if this was a relational database the equivalent statement in SQL would be something like:

INSERT INTO Person(name, email) VALUES('Alice', 'alice@wherever.com');

Searching for a Node

Now that we have a node in here, how would we search for it? In general we can search using Cypher’s MATCH statement, which you can think of as the equivalent of SQL’s SELECT. Like SELECT, we use MATCH all the time when working with Neo4j.

MATCH ( k{name:"Alice"}) RETURN k

In this statement we have a ‘k’ in there before the attribute list. That is a variable that we can use elsewhere in the statement, and in fact we use it at the end in the RETURN clause to actually return the value. In fact, MATCH requires a RETURN, SET, or DELETE clause at the end, otherwise it the statement is considered incomplete.

If you run this command in the browser, Neo4j 2.x will give you a D3-based visualization of your result set. You can click on the node(s) to show all the attributes. This kind of feedback makes learning and developing your graph statements in Neo4j very helpful in fact.

Our statement here returned the entire node. If you want just a particular attribute you can return that.

MATCH ( k{name:"Alice"}) RETURN k.email

Updating and Deleting a Node

If we have a node that we just want to update a value in, or add another key-value pair? Again, we use MATCH but this time we end with a SET clause.

MATCH ( k{name:"Alice"}) SET k.email='alice@wherever.com'

In this case we updated the email column. If we wanted to add a new property we would just list it — there is no syntactic difference between updating an existing property or adding a new one. In SQL we would first have to alter the table to add a place for the new column and then we could set a value.

Deletion is very similar to updating — we just specify to delete the node at the end instead of returning it.

MATCH ( k{name:"Alice"}) DELETE k

 A point on deleting nodes — Neo4j will not let you delete a node if it still has edges connected to it. You first have to delete the edges and then the node. But as we’ll see you can do that in one statement.

Creating Relationships

So, this is a graph database. A graph database with only nodes is kinda dull and uninteresting really. So how do we create connections?

First let’s add a few more nodes to our system for demonstration purposes. We’ll also re-create ‘Alice’ since we deleted that node above. And, while we’re at it, we’ll also do this in one statement to show how to add multiple nodes at a time.

 CREATE ( p0: Person { name: "Alice",
                       email: "alice@wherever.com"
                      } ),
        ( p1: Person { name: "Ezekial",
                  email: "zeke@nowhere.com"
                } ),
        ( p2: Person { name: "Daniel",
                   email: "dan@nowhere.com"
                  } ),
        ( p3: Person { name: "Bob",
                   email: "bob@nowhere.com"
                  })

A couple things to note here: we added a variable before each person we added (p0, p1, p2, p3). When adding multiple nodes we need to add something to distinguish between these, and as we create more involved queries the utility of that will become more evident.  For now take it as read that if you omit that, Neo4j will complain that ‘Person’ was already declared.

Now let’s find ‘Alice’ and create a relationship to Bob.

MATCH ( p1 {name:"Alice"}), ( p2 {name:"Bob"}) CREATE (p1)-[r:IS_FRIENDS_WITH]->(p2)

So… this takes a little deconstruction. We started with our MATCH statement but instead of just retrieving one node we retrieved two. This is where those variables — p1 and p2 — come into place. You can think of them as being kinda/sorta like aliases in SQL.

 Once we find the two nodes we can create the link between them. Edges are always directed edges in Neo4j, and the edge is represented with the ‘start’ node followed by the relationship label and then the second node. The usual way of describing this is to think of an arrow connecting the two, as you might write it in ascii-art:  ‘(first node)-[r:RELATIONSHIP_LABEL]->(second node)’.  That ‘r’ is arbitary but you do need a variable there, otherwise Neo4j will give you an error ‘A single relationship type must be specified for CREATE’.

 Searching on Relationships

At this point we have something that is becoming a more meaningful graph, albeit a small one. We have a few nodes and a relationship between a couple of them.

MATCH (p1 {name:"Alice"})-[r:IS_FRIENDS_WITH]->(p2) RETURN p2

Again, we use MATCH like we would use SELECT in a relational database. In this case we specify the relationship with that ‘arrow’-like syntax. You’ll notice that we specified p1 to be ‘Alice’ by specifying the attribute, but we didn’t do so for p2 — p2 is what we want to find in this query. When you run this you should see just one node returned, ‘Bob’.

Part 1 Summary

At this point we have covered the very basic operations involved in creating, updating, and deleting nodes, and we started in on how to create and query on edges. In this next post on this topic we’ll continue the discussion on setting up edges and more involved queries.

Getting started with Neo4j

Somewhat recently I’ve been spending some time investigating graph databases, and Neo4j in particular. This first crept onto my radar a few months ago as it relates to an ongoing side project I’m involved with, but at the time I didn’t have the cycles to look into it. More recently another potential application for a graph database came up in the context of my regular day job. Paying attention to the ques I’m getting from life, therefore, it seemed like the time was right to get better acquainted with graph databases. Not that I’ve spent a bit of time on this I thought it would be worth writing about that.

So far I’ve focused more on Neo4j, which has been around for a little while now. I have started to look at Titan some, and Giraph is also on my radar. I know there are other graph databases out there as well, but for the moment it seems that looking at these three are plenty to keep one busy. For what it is worth, I have been able to get up and running more quickly with Neo4j.

Setting up the environment

Neo4j — as well as Titan — are java-based server applications, so you’ll want to have java set up on your server. There are some pretty good instructions for doing this on the Neo4j site too actually, so mainly I’m documenting things for (and from) my own experience. In my case I was setting up Neo4j 2.1.2 to run on a Fedora Linux cloud server.

1) Install Java

You’ll need Java set up. I’ve run this okay with OpenJDK but it is probably better to run with Oracle’s Java. So download that and get that set up since you’ll need that anyway before you can actually do anything interesting. (In my case I wanted the full JDK but if you just want to run things the JRE should be fine.)

To go with OpenJDK....
 % yum install java-1.7.0-openjdk.x86_64

Or with Oracle’s Java… I kinda like to get the .tar.gz and unpack this under /opt myself. You can also get the RPM and install that. In the default AWS AMI Linux, /usr/bin/java points to /etc/alternatives/java, which itself points to openjre. I just change that to point to Oracle’s Java.

% mv jdk-8u20-linux-x64.tar.gz /opt
% gzip -d jdk-8u20-linux-x64.tar.gz
% tar -xf jdk-8u20-linux-x64.tar
% ln -s jdk1.8.0_20 jdk
% cd /etc/alternatives/
% mv java openjre-java
% ln -s /opt/jdk/bin/java java
% which java
/usr/bin/java
% java -version
java version "1.8.0_20"
Java(TM) SE Runtime Environment (build 1.8.0_20-b26)
Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)

If you work in Python, then while you are at it you may want to install the python library to access the database.

% yum install python-py2neo.noarch

2) Create a neo4j user on the system.

This is recommended practice I think, but isn’t super-required. Pretty much you can run under whatever user you like.

3) Install Neo4j

Go and download the Community Edition of Neo4j appropriate for your environment. (There is also an Enterprise Edition but you should have a license for that, although if you fill in a form on the Neo site then you can play around with the Enterprise Edition for free if you are just kicking the tires, doing a student project, etc.) The Community Edition is released under an Apache License.)  Unpack this somewhere convenient… perhaps just /home/neo4j or /opt/neo4j, as suits your preference.

4) System adjustments

Increase the number of open files allowed.

If you were to start up the Neo4j server you will probably get a message that looks like this:

[ec2-user@ip-172-31-22-223 neo4j]$ ./bin/neo4j start
WARNING: Max 1024 open files allowed, minimum of 40 000 recommended. See the Neo4j manual.
Using additional JVM arguments:  -server -XX:+DisableExplicitGC -Dorg.neo4j.server.properties=conf/neo4j-server.properties -Djava.util.logging.config.file=conf/logging.properties -Dlog4j.configuration=file:conf/log4j.properties -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled
Starting Neo4j Server...WARNING: not changing user
process [21385]... waiting for server to be ready....

So, edit /etc/security/limits.conf to add these lines:

neo4j        soft    nofile          40000
neo4j        hard    nofile          40000

5) Configure Neo4j

The default configuration should be fine to start with, particularly if you are just running on your laptop.  But if you are running on a server you set up somewhere you will need to adjust the configuration to have it be accessible on something besides localhost (127.0.0.1).

In the ~neo4j/conf directory there are several configuration files. Edit the neo4j-server.properties file and un-comment this line:

org.neo4j.server.webserver.address=0.0.0.0

6) Start the neo4j server.

In the neo4j bin directory you should see a file named simply neo4j.  That’s a shell script that lets you start, stop, and get the status of the server.

[neo4j@yourserver neo4j]$ ./bin/neo4j start
 Using additional JVM arguments:  -server -XX:+DisableExplicitGC -Dorg.neo4j.server.properties=conf/neo4j-server.properties -Djava.util.logging.config.file=conf/logging.properties -Dlog4j.configuration=file:conf/log4j.properties -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled
 Starting Neo4j Server...WARNING: not changing user
 process [2994]... waiting for server to be ready...... OK.
 http://localhost:7474/ is ready.

At this point you can bring up your web browser at the appropriate url for your server and you are good to go with working with Neo4j. The server has a very user friendly interface that allows you to start doing queries and adding in graph data. There is a lot of helpful tutorial information there right off the bat too.

neo4j_in_browser

In the next post I’ll cover more about the graph data model and working with Cypher — Neo4j’s query language.