All posts by Peter Spangler

Bulk loading data into ElasticSearch

Every database needs to have a way to load multiple records into it. With MySQL you can import a csv file, for instance.  ElasticSearch has a way too and it generally works well.  There’s just a couple important points to working with it that can trip one up.

  1. Bulk API entry point.In this example let’s say we have an index called ‘library’ (referenced in the previous blog post on ES) and in that we have a type named, simply, ‘books’.  The API url will be have _bulk on the end of it.

    http://localhost:9200/library/books/_bulk

    Now, as it happens… the index and type parts of this are pretty much not needed because you can put that into the data file itself.  So all you really need is this:

    http://localhost:9200/_bulk

  2. You can do a curl POST to this endpoint and load your data file.  The second point is to use the –data-binary option to preserve newline characters.  We can also reference our actual data file using @data_file_path.  So our curl call looks like this:curl -XPOST ‘http://localhost:9200/library/books/_bulk’ –data-binary @library_entries.json
  3. Now.. about the data file itself.  This is not exactly a legal JSON format, but rather a series of json entries.  Here is an abbreviated sample:
    { "index" : { "_index" : "library", "_type": "books" }}
    { "isbn13" : "978-0553290998", "title" : "Nightfall", "authors" : "Isaac Asimov, Robert Silverberg" }
    { "index" : { "_index" : "library", "_type" : "books" } }
    { "isbn13"   : "978-0141007540", "title"  : "Empire: How Britain Made the Modern World", "authors": "Niall Ferguson" }
    { "index" : { "_index" : "library", "_type" : "books" } }
    { "isbn13" : "978-0199931156", "title" : "The Rule of Empires", "authors" : "Timothy H. Parsons" }
    
    

    You want to view the data file as pairs of lines.  The first line tells ES what to do — in this case index (ie. insert) an entry into a specified index and type.  You can also delete or update content too.

    The second line is the actual content.  This is pretty straight-forward as having the data for the entry.  The one trick here is that you have to have the full entry on ONE LINE.  If you try to split it up (say, to make the thing more readable) then that will confuse ES.  So one line per entry.

  4. Finally… you must have a last blank line in your data file!  If you don’t then you will likely lose the last entry.

External References
Bulk API | ElasticSearch [2.4]

Using a custom function to score documents in Elastic Search

At my previous position ElasticSearch formed an important part of our back-end infrastructure, and over the past year and a half we were expanding our use of ElasticSearch. It wasn’t our primary data store but it became a highly important secondary store that we used to query our data in ways that would be difficult or prohibitive in our MySQL database.

In general, the way you will interact with ElasticSearch is to use the REST API. Working with ES means learning this API, just as working with a relational database means understanding SQL.  For the past year and a half I’ve been working with ES and having come to an understanding of it. I’ve been wanting to write up some of my notes and observations on ES and so this short article will probably be the first of a few such posts on working with ES.

After you have added your set of documents in ES you will of course do different kinds of queries.  ES will then retrieve the documents that match and compute a score value that is used to order the results.  When you get back your results you can see this value as _score.

What if the way ES scores documents doesn’t work for you though?  What can you do then? It turns out that there is support for providing your own scoring function for documents.

A small example

To illustrate this there is a very simplified library catalog index, the data, mapping, and queries for which are in a github project. This was run with ES 2.4.1.  (Presumably this should work with ES 5 but that is still in Alpha as of this writing and I haven’t tested it out yet.)

One of the features we used in most all of our queries was the ability to create a custom score instead of letting ES score documents.  To illustrate the mechanics of this let’s say we create an index and add several documents to it. In this example we have a small set of books in an index named ‘library’.

{
  "from" : 0,
  "size" : 10,
  "query": {
    "bool": {
      "must": [ 
           { "match": { "title" : "world" } } 
      ]
    }
  }
}

When you run this query you will receive back results ordered according to the score computed by ElasticSearch.  Now that we have this basic query we’ll modify it to do our own custom formula for scoring.

To introduce our own formula we add a function_score element that includes our original query and then specifies a script score.

{
 "from": 0,
 "size": 10,
 "query": {
 "function_score": {
      "query": {
           "bool": {
                "must": [
                   { "match": {"title": "world" } }
                ]
        }
 },
  "functions": [
         {
           "script_score": {
             "script": "(doc['pub_year'].value / 1000) * 2"
           }
         }
       ]
    }
   }
 }

In this case our script just took the publication year, divided by 1000 and then multiplied it by 2, which is admittedly sort of random…. but the point is just to illustrate the mechanics of how the query is constructed and to show you can access other fields in the document.

Configuring ES to allow in-line scripts

If you run this query on a vanilla ES 2.4.x install then you will likely see the following error message:

{
    "error": {
        "root_cause": [
            {
                "type": "script_exception",
                "reason": "scripts of type [inline], operation [search] and lang [groovy] are disabled"
            }
        ],
        ...
    "status": 400
}

To get around this problem you need to edit the ~elasticsearch/config/elasticsearch.yml file to add a couple of lines:

script.inline: true
script.indexed: true

You will then need to restart ElasticSearch for this change to take effect.

NOTE:  This is sorta important! Making this change in ES is essentially allowing for code to be run via a query. You will want to be very sure to have ES nicely isolated in your network and behind some other API if you run this for real in production where you control what queries go to this.

The Gotcha …

Now that we’ve done that we run into another problem. If you look at our results you will see that the actual score is not just the publication year / 1000 and multiple by 2, but some other number.

"hits" : {
    "total" : 5,
    "max_score" : 2.1166306,
    "hits" : [ {
      "_index" : "library",
      "_type" : "books",
      "_id" : "AVfQRXHBmdIL8FYiyaJH",
      "_score" : 2.1166306,
      "fields" : {
        "title" : [ "Empire: How Britain Made the Modern World" ],
        "pub_year" : [ 2008 ]
      }
    }, {
      "_index" : "library",
      "_type" : "books",
      "_id" : "AVfQRXHBmdIL8FYiyaJS",
      "_score" : 1.620065,
      "fields" : {
        "title" : [ "The Russian Origins of the First World War" ],
        "pub_year" : [ 2013 ]
      }
    }, {
      "_index" : "library",
      "_type" : "books",
      "_id" : "AVfQRXHBmdIL8FYiyaJJ",
      "_score" : 1.6184554,
      "fields" : {
        "title" : [ "Empires in World History: Power and the Politics of Difference" ],
        "pub_year" : [ 2011 ]
      }
    }, {
      "_index" : "library",
      "_type" : "books",
      "_id" : "AVfQRXHBmdIL8FYiyaJR",
      "_score" : 1.4131951,
      "fields" : {
        "title" : [ "Immanuel Wallerstein and the Problem of the World: System, Scale, Culture" ],
        "pub_year" : [ 2011 ]
      }
    }, {
      "_index" : "library",
      "_type" : "books",
      "_id" : "AVfQRXHBmdIL8FYiyaJK",
      "_score" : 1.256875,
      "fields" : {
        "title" : [ "How to Change the World: Reflections on Marx and Marxism" ],
        "pub_year" : [ 2011 ]
      }
    } ]
  }

The default behavior in ElasticSearch is to take your script score and multiply the computed _score by it. This can be somewhat unexpected since you explicitly provided a scoring function.  The solution though is to set ‘boost_mode‘ appropriately, which in the case of just using our computed score is to set this to ‘replace’.

{
  "from": 0,
  "size": 10,
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "title": "world"
              }
            }
          ]
        }
      },
      "functions": [
        {
          "script_score": {
            "script": "doc['title'].size() * 2"
          }
        }
      ],
      "boost_mode": "replace"
    }
  }
}

 

External References

Function Score Query | ElasticSearch Reference [2.4]

Enabling Dynamic Scripting | ElasticSearch Reference [2.4]

 

Troubleshooting SSH keys

SSH is one of those workhorse utilities that make lots of things work. Whole applications are built on top of ssh (for instance, Ansible).  Being able to set up ssh keys is, from a functional perspective, what makes this really work well. Sometimes when setting up SSH keys things fail to work and it is not at all clear why not. A lot of these are fairly simple things that amount to user error. The problem with user error is that it can be hard to detect when you are falling into that.

First..just for completeness.. what is the procedure for setting up ssh keys?  Fortunately this process is well documented on the Interwebz. See this page for instance, which is pretty readable.  Where these pages stop is in troubleshooting (which is why I have this post here.)

So… here’s a list of things to check….

  1. When you copied your public key to the remote host, the permissions on the directory and/or file are too open. SSH requires your .ssh directory to only be readable by the user you are logging in as (well, and root..)
  2. When you try to ssh into your remote server, it fails because you didn’t specify the right key name, or made a typo.
  3. You copied your public key to the server with the name ‘authorized_keys2’ and your ssh daemon was expecting ‘authorized_keys’. (Or vice-versa.)  It seems that the preferred thing now is just to use ‘authorized_keys’.  The exact name is set in the /etc/ssh/sshd_config file with the parameter name ‘AuthorizedKeysFile’.
  4. Often you will have more than one set of ssh keys on your laptop for different servers. Maybe you have one set for work related machines and another for a side project. When generating ssh keys you are prompted to give the filename, and so you do that. So… then you try connecting to your server … and it isn’t working?  “Why am I still being prompted for my password?” Well, you didn’t send the right key.  Again, your local .ssh/config is your friend here (instead of having to specify the key all the time with -i).  Add an entry IdentityFile=~/.ssh/my_server.rsa and on your laptop the appropriate ssh key will be sent to the server.
  5. Bonus points if you set up a different port for your ssh daemon to listen on. An easy mistake to make though is to neglect to use that non-standard port when you connect.  In your ssh config you will want to add this by just specifying the port with simply ‘Port’.

Running sshd manually

If you are still having problems, and if you are able to become root and manually run the sshd daemon in debug mode then you may find that helpful. Then you can watch both the client side and server side interact and see if you get any helpful messages from the daemon.

To run the daemon manually:

/usr/sbin/sshd -p 8888 -D -d -e

Update

So with OpenSSH 7 DSA keys have been deprecated by default.  If you suddenly find that you are prompted for a password or you just plain can’t get in at all then this may be the problem.  If you run ssh with the -v verbose flag then you may see this message:

Skipping ssh-dss key .ssh/whatever_id_dsa - not in PubkeyAcceptedKeyTypes

(I happened to notice this after I upgraded to Mac OS X Sierra recently.)

So, you can override this behavior by adding a line in your ssh config on your own machine  by adding a ‘PubkeyAcceptedKeyTypes‘ line…

Host myserver.org
     User me
     PubkeyAcceptedKeyTypes +ssh-dss
     IdentitiesOnly yes
     IdentityFile=~/.ssh/myserver_id_dsa
     Hostname=myserver.org
     TCPKeepAlive yes
     ServerAliveInterval 15

But probably the better thing to do is to regenerate your ssh keys and use RSA instead.  Of course you might temporarily need to add this override in order to get back into your account to update the key 😉

 

Cypher and Neo4j: Part I

A few months ago I started working with graph databases. This post is part of a series aimed at documenting how to work with a graph database, particularly for those coming from a relational database background.

At a practical level, when first working with a database you want to know how to get it installed and running (which was the subject of an earlier post) and then how to do the basic CRUD operations: creating data, retrieving it, making updates, and then deleting things. The purpose of this post is just to focus on using Neo4j and the query language particular to that database, Cypher.

One of the very nice features of Neo4j 2 is that they included a very friendly way to interact with the database by just pointing your web browser at it. This is not how you will work with Neo4j at scale but when learning to use it, it is invaluable.  When working within the browser you generally enter one statement at a time. If you try to put two statements separated by a semi-colon, Neo4j will get confused.

In Neo4j you work with a property graph data model, which is to say a set of nodes with connecting edges, both of which can have properties — or attributes if you prefer — attached to them. In Neo4j nodes and edges can also have Labels, which you can think of as allowing you to define a type of node or edge. (This is in contrast to Titan, where edges can have labels, but not nodes.)

The first thing to note is that in Neo4j, you do not need to set up a schema first before you can start loading in and using the database. In fact there isn’t really a concept of a schema like you would have in a relational database, and in Cypher there isn’t an equivalent of the Data Definition Language (DDL) that we have in SQL; there aren’t equivalents for ‘create table’, ‘drop table’, ‘alter table’. If you are coming from primarily a relational database background then this feels a bit odd certainly.

Creating a Node

The example below shows how we might create a single node with a couple of attributes. Note that the indentation here is purely to enhance readability — Neo4j will process this fine if it is all on the same line.

CREATE ( Person { name: "Alice",
                  email: "alice@wherever.com"
                }
      )

Cypher is deliberately designed to feel like SQL. Instead of SQL’s ‘INSERT’ though we add new nodes to the database with a ‘CREATE’ command. Now we’ll deconstruct the rest of this statement: The ‘Person’ part is a label that allows you to identify the particular kind of node this is.  Unless you have a graph in which all the nodes are the same type, then you will likely want to add a descriptive label here. The label needs to start with a letter but it can be composed of letters and numbers.  All the properties for this node are in curly braces as a comma-separated list of key-value pairs, with a colon separating keys from values (instead of an equals sign.)

Note that there is nothing here in this statement that identifies a primary key. Nor do we have a concept of referential integrity between kinds of nodes. Enforcing uniqueness is possible using the MERGE command, which we’ll get to in a later post. Internally in Neo4j there is a unique node id and you can make use of that, but you shouldn’t rely on that much.

Just for comparison, if this was a relational database the equivalent statement in SQL would be something like:

INSERT INTO Person(name, email) VALUES('Alice', 'alice@wherever.com');

Searching for a Node

Now that we have a node in here, how would we search for it? In general we can search using Cypher’s MATCH statement, which you can think of as the equivalent of SQL’s SELECT. Like SELECT, we use MATCH all the time when working with Neo4j.

MATCH ( k{name:"Alice"}) RETURN k

In this statement we have a ‘k’ in there before the attribute list. That is a variable that we can use elsewhere in the statement, and in fact we use it at the end in the RETURN clause to actually return the value. In fact, MATCH requires a RETURN, SET, or DELETE clause at the end, otherwise it the statement is considered incomplete.

If you run this command in the browser, Neo4j 2.x will give you a D3-based visualization of your result set. You can click on the node(s) to show all the attributes. This kind of feedback makes learning and developing your graph statements in Neo4j very helpful in fact.

Our statement here returned the entire node. If you want just a particular attribute you can return that.

MATCH ( k{name:"Alice"}) RETURN k.email

Updating and Deleting a Node

If we have a node that we just want to update a value in, or add another key-value pair? Again, we use MATCH but this time we end with a SET clause.

MATCH ( k{name:"Alice"}) SET k.email='alice@wherever.com'

In this case we updated the email column. If we wanted to add a new property we would just list it — there is no syntactic difference between updating an existing property or adding a new one. In SQL we would first have to alter the table to add a place for the new column and then we could set a value.

Deletion is very similar to updating — we just specify to delete the node at the end instead of returning it.

MATCH ( k{name:"Alice"}) DELETE k

 A point on deleting nodes — Neo4j will not let you delete a node if it still has edges connected to it. You first have to delete the edges and then the node. But as we’ll see you can do that in one statement.

Creating Relationships

So, this is a graph database. A graph database with only nodes is kinda dull and uninteresting really. So how do we create connections?

First let’s add a few more nodes to our system for demonstration purposes. We’ll also re-create ‘Alice’ since we deleted that node above. And, while we’re at it, we’ll also do this in one statement to show how to add multiple nodes at a time.

 CREATE ( p0: Person { name: "Alice",
                       email: "alice@wherever.com"
                      } ),
        ( p1: Person { name: "Ezekial",
                  email: "zeke@nowhere.com"
                } ),
        ( p2: Person { name: "Daniel",
                   email: "dan@nowhere.com"
                  } ),
        ( p3: Person { name: "Bob",
                   email: "bob@nowhere.com"
                  })

A couple things to note here: we added a variable before each person we added (p0, p1, p2, p3). When adding multiple nodes we need to add something to distinguish between these, and as we create more involved queries the utility of that will become more evident.  For now take it as read that if you omit that, Neo4j will complain that ‘Person’ was already declared.

Now let’s find ‘Alice’ and create a relationship to Bob.

MATCH ( p1 {name:"Alice"}), ( p2 {name:"Bob"}) CREATE (p1)-[r:IS_FRIENDS_WITH]->(p2)

So… this takes a little deconstruction. We started with our MATCH statement but instead of just retrieving one node we retrieved two. This is where those variables — p1 and p2 — come into place. You can think of them as being kinda/sorta like aliases in SQL.

 Once we find the two nodes we can create the link between them. Edges are always directed edges in Neo4j, and the edge is represented with the ‘start’ node followed by the relationship label and then the second node. The usual way of describing this is to think of an arrow connecting the two, as you might write it in ascii-art:  ‘(first node)-[r:RELATIONSHIP_LABEL]->(second node)’.  That ‘r’ is arbitary but you do need a variable there, otherwise Neo4j will give you an error ‘A single relationship type must be specified for CREATE’.

 Searching on Relationships

At this point we have something that is becoming a more meaningful graph, albeit a small one. We have a few nodes and a relationship between a couple of them.

MATCH (p1 {name:"Alice"})-[r:IS_FRIENDS_WITH]->(p2) RETURN p2

Again, we use MATCH like we would use SELECT in a relational database. In this case we specify the relationship with that ‘arrow’-like syntax. You’ll notice that we specified p1 to be ‘Alice’ by specifying the attribute, but we didn’t do so for p2 — p2 is what we want to find in this query. When you run this you should see just one node returned, ‘Bob’.

Part 1 Summary

At this point we have covered the very basic operations involved in creating, updating, and deleting nodes, and we started in on how to create and query on edges. In this next post on this topic we’ll continue the discussion on setting up edges and more involved queries.

Getting started with Bukkit plugin development

This post pertains specifically to Bukkit, which is no longer available due to a DMCA takedown request, and the future of Bukkit is somewhat in question now unfortunately.

Getting started developing in any platform always has a learning curve as you find your way around the language(s), tools and conventions involved. Developing plugins for Bukkit is no different. There are a number of good tutorial resources on this topic, so rather than do a full write-up we’ll link to a couple of tutorials and then add some additional comments to outline the process and identify some of the things to watch for — the ‘gotchas’ of plugin development.

The general process for getting started can be enumerated in these steps:

Install the following tools to set up your development environment.

  • Download Oracle’s Java SE JDK (the JRE alone is not sufficient). Go ahead and install that first since you will need this to run Eclipse.
  • Download and install Eclipse (if you are already set up to do Java development with another IDE, such as NetBeans, that’s great. In this post we’ll focus on Eclipse though.)
  • Get the Bukkit jar file. (craftbukkit-1.7.10-R0.1-20140804.183445-7.jar) [Which you can’t get anymore]

You will find several tutorials (both documents and videos) with a bit of searching online. But two pretty reasonable tutorials to get your started are here:

Bukkit Coding for Dummies — a 10 page Google Doc covering pluggin development using Eclipse. This has some useful screen shots and code examples.

How to make a Bukkit plugin — This site has several similar examples, also with screen shots and code examples. To pick one tutorial, take a look at Part 10 on invisibility and custom books.

These tutorials will give you the details. For the sake of having a more general overview though, the steps for creating a plugin is basically this:

1. Launch into Eclipse and select your workspace — some directory somewhere on your computer that you can let Eclipse more or less have free reign in. Eclipse will use that directory

2. Create a new Java Project.

3. Add a new Package. This is basically your java source file. The convention in the world of Java is to name packages using a reverse-url, so you might use com.myfullname.bukkitone. (And no, you don’t need to really have the name registered or anything like that… although it is a nice touch if you do.)

4. Write up your plugin (see the tutorials above for more details there.) Be sure to import the appropriate libraries as needed.

5. Create your YML file for your Pluggin. This is where you will list the commands your plugin recognizes.

6. Compile your code into a JAR file. Select your project in the Package Explorer and then Select Export -> JAR File. Uncheck .classpath and .project from the resources to export, and select the location for Eclipse to save your jar file. You can click through to the next couple of screens but the defaults after this should be fine. You can also just hit ‘Finish’.

7. Copy your jar file to your Minecraft server’s plugin directory and start up Minecraft. Minecraft should recognize the plugin and load it. You should see this if you tail the logs/latest.log file.

Things that may trip you up….

Q1) I can’t compile my plugin but my code looks right? Except that Eclipse doesn’t recognize JavaPlugin and a bunch of other names?
A1) You probably don’t have the craftbukkit jar file loaded, or it is somewhere that Eclipse can’t find it (this may be the case if you cloned a plugin from a Git repo and the Eclipse project is referencing a jar file in a path that doesn’t exist on your computer.) You can correct that by simply adding the jar file (which you should have downloaded by now.)
  1. Go to your projects properties and go to the Java Build Path section.
  2. Click on the Libraries tab.
  3. Click on the Add External Jars button and find the craftbukkit jar file on your computer.
Q2) Why is my plugin not running on my server? It runs okay on my laptop?
A2) This is generally due to a mismatch between the version of Java that your plugin was compiled for and the version you are running on your server. Particularly if you see a message in your error log that looks something like:
Could not load ‘plugins/MyPluggin.jar’ in folder ‘plugins’
org.bukkit.plugin.InvalidPluginException java.lang.UnsupportedClassVersionError de/nofear13/craftbukkituptodate/CraftBukkitUpToDate : Unsupported major.minor version 51.0
You can’t load a plugin compiled with Java8 on a server where you are running Bukkit with java7. So what are you supposed to do if you can’t control which version of java is running on the server?
The short answer is that you have to compile your pluggin to target Java 6 or higher. In Eclipse you should do the following:
  1. Go to your projects properties and go to the Java Compiler section.
  2. Uncheck ‘Use compliance from execution environment…’
  3. Set Compiler Compliance Level to 1.6
After that try re-building your jar file and try that.
Q3) In my java code I added a command, but I can’t invoke the command?
A3) Besides the Java source you need to list commands in the plugin’s yml file (a lot of examples don’t note this at all)

Getting started with Neo4j

Somewhat recently I’ve been spending some time investigating graph databases, and Neo4j in particular. This first crept onto my radar a few months ago as it relates to an ongoing side project I’m involved with, but at the time I didn’t have the cycles to look into it. More recently another potential application for a graph database came up in the context of my regular day job. Paying attention to the ques I’m getting from life, therefore, it seemed like the time was right to get better acquainted with graph databases. Not that I’ve spent a bit of time on this I thought it would be worth writing about that.

So far I’ve focused more on Neo4j, which has been around for a little while now. I have started to look at Titan some, and Giraph is also on my radar. I know there are other graph databases out there as well, but for the moment it seems that looking at these three are plenty to keep one busy. For what it is worth, I have been able to get up and running more quickly with Neo4j.

Setting up the environment

Neo4j — as well as Titan — are java-based server applications, so you’ll want to have java set up on your server. There are some pretty good instructions for doing this on the Neo4j site too actually, so mainly I’m documenting things for (and from) my own experience. In my case I was setting up Neo4j 2.1.2 to run on a Fedora Linux cloud server.

1) Install Java

You’ll need Java set up. I’ve run this okay with OpenJDK but it is probably better to run with Oracle’s Java. So download that and get that set up since you’ll need that anyway before you can actually do anything interesting. (In my case I wanted the full JDK but if you just want to run things the JRE should be fine.)

To go with OpenJDK....
 % yum install java-1.7.0-openjdk.x86_64

Or with Oracle’s Java… I kinda like to get the .tar.gz and unpack this under /opt myself. You can also get the RPM and install that. In the default AWS AMI Linux, /usr/bin/java points to /etc/alternatives/java, which itself points to openjre. I just change that to point to Oracle’s Java.

% mv jdk-8u20-linux-x64.tar.gz /opt
% gzip -d jdk-8u20-linux-x64.tar.gz
% tar -xf jdk-8u20-linux-x64.tar
% ln -s jdk1.8.0_20 jdk
% cd /etc/alternatives/
% mv java openjre-java
% ln -s /opt/jdk/bin/java java
% which java
/usr/bin/java
% java -version
java version "1.8.0_20"
Java(TM) SE Runtime Environment (build 1.8.0_20-b26)
Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)

If you work in Python, then while you are at it you may want to install the python library to access the database.

% yum install python-py2neo.noarch

2) Create a neo4j user on the system.

This is recommended practice I think, but isn’t super-required. Pretty much you can run under whatever user you like.

3) Install Neo4j

Go and download the Community Edition of Neo4j appropriate for your environment. (There is also an Enterprise Edition but you should have a license for that, although if you fill in a form on the Neo site then you can play around with the Enterprise Edition for free if you are just kicking the tires, doing a student project, etc.) The Community Edition is released under an Apache License.)  Unpack this somewhere convenient… perhaps just /home/neo4j or /opt/neo4j, as suits your preference.

4) System adjustments

Increase the number of open files allowed.

If you were to start up the Neo4j server you will probably get a message that looks like this:

[ec2-user@ip-172-31-22-223 neo4j]$ ./bin/neo4j start
WARNING: Max 1024 open files allowed, minimum of 40 000 recommended. See the Neo4j manual.
Using additional JVM arguments:  -server -XX:+DisableExplicitGC -Dorg.neo4j.server.properties=conf/neo4j-server.properties -Djava.util.logging.config.file=conf/logging.properties -Dlog4j.configuration=file:conf/log4j.properties -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled
Starting Neo4j Server...WARNING: not changing user
process [21385]... waiting for server to be ready....

So, edit /etc/security/limits.conf to add these lines:

neo4j        soft    nofile          40000
neo4j        hard    nofile          40000

5) Configure Neo4j

The default configuration should be fine to start with, particularly if you are just running on your laptop.  But if you are running on a server you set up somewhere you will need to adjust the configuration to have it be accessible on something besides localhost (127.0.0.1).

In the ~neo4j/conf directory there are several configuration files. Edit the neo4j-server.properties file and un-comment this line:

org.neo4j.server.webserver.address=0.0.0.0

6) Start the neo4j server.

In the neo4j bin directory you should see a file named simply neo4j.  That’s a shell script that lets you start, stop, and get the status of the server.

[neo4j@yourserver neo4j]$ ./bin/neo4j start
 Using additional JVM arguments:  -server -XX:+DisableExplicitGC -Dorg.neo4j.server.properties=conf/neo4j-server.properties -Djava.util.logging.config.file=conf/logging.properties -Dlog4j.configuration=file:conf/log4j.properties -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled
 Starting Neo4j Server...WARNING: not changing user
 process [2994]... waiting for server to be ready...... OK.
 http://localhost:7474/ is ready.

At this point you can bring up your web browser at the appropriate url for your server and you are good to go with working with Neo4j. The server has a very user friendly interface that allows you to start doing queries and adding in graph data. There is a lot of helpful tutorial information there right off the bat too.

neo4j_in_browser

In the next post I’ll cover more about the graph data model and working with Cypher — Neo4j’s query language.

 

 

 

 

 

Failing to write out a file from MySQL — a tale of systems behavior

Sometimes you get into a situation where two separate issues conspire to drive you to question your sanity. For me this recently happened when I was trying to do a pretty simple operation: I just wanted to write out a result set to a .csv file from MySQL into the local file system. The problem (initially, anyway) was that I was logged into MySQL, ran the command, which worked without complain.. but when I went to look in the /tmp directory for my file… nada.

So, tried the command again just to see what would happen. Now MySQL complained (quite rightly) that the file already existed, yet I didn’t see it.

At this point the problem was that this particular machine I was on had a newer Fedora Linux install on it and one of the changes made was to add private tmp directories. It looks like that feature was added in Fedora 16 and they expanded the list of services that by default use private tmp directories in Fedora 17. Basically a subdirectory under /tmp will get created with some random string, and that’s where, in this case, my file was getting written to. There’s definitely a logic to that, but if you weren’t aware of that and/or didn’t have that in mind it is an easy thing to miss. And truth be told other folks have written about this in various places (such as this discussion post and this blog post I found later.) At least it wasn’t just me…

Initially I didn’t realize the problem of the private tmp directories though. My next step was just to try to write out my .csv file to a couple of other directories on the system. I set up a /mytmp directory and /usr/local/mytmp with appropriate permissions, but MySQL complained that it couldn’t write to the system. Huh? At this point I was confused… what was going on?

So, problem number two was that I didn’t realize SE Linux was enforcing limitations on where services, such as the MySQL daemon, could write to. This message gets logged to /var/log/messages as well, which is how I identified what was going on here. (This too has been commented on in this discussion post and elsewhere.)

In this particular case, the solution for me was to put SELinux into permissive mode, which then allowed MySQL to write out the my file somewhere I could have another script pick it up.

At the end it was clear that these two problems, although separate, did converge in my particular use case. This is an example of Apparently pathological behavior in systems is the result of different components doing exactly what they are supposed to do. Once you understand each component and its behaviors the system makes more sense.

A peek into the world of game developers

It can be interesting to see what the folks involved in another industry are discussing. This week the Game Developers Conference has been going on in San Francisco, and while I didn’t attend GDC (seeing as I don’t work in the games industry even ) I did head up to check out a couple of events taking place around GDC.

The first was an unconference taking place at the Yerba Buena Center for the Arts called Lost Levels. This was the first unconference I’ve really attended and the format was pretty interesting. The venue was the lawn area at YBCA on which three large tarps were placed to designate three ‘session areas’ (my term — not their’s, as far as I know). People signed up on a board to hold forth on some topic for around ten minutes. The organizers of the event would time speakers and call up the next speaker in turn while people gathered around one session area or another to listen in. By and large this was a simple system that seemed to work quite well — people could have their say on a topic and if others wanted to talk with them more they could follow up afterwards.

So, what can I say about the actual talks given at Lost Levels? There were a wide variety of talks and there was no way that I was going to be able to hear all or even most of them. I definitely do not want to give the impression of what the whole unconference was like — just the talks that piqued my interest.

  • At least a couple of speakers discussed interactive fiction (or, perhaps slightly more inclusive, interactive narrative). The impression I got was that this was being ignored by most of the game development community — a point that I’m not informed enough to comment on.
  • Another interesting talk challenged the limitations of character identities in games. Often a player is encouraged to create and identify with a character in a game, but at the same time there are often very few ways that players can really modify their character. The example brought up was gender, which is often simply a male or female option. By now we should know that is pretty limiting for a lot of folks. The speaker challenged game developers to build in more flexibility for players.
  • Another theme that came up in at least a couple of talks was the idea of non-competitive or non-confrontational games. Probably for most gamers the whole idea of a game is defined in some way by competition and conflict. There are game that do not necessarily have this element, and the speakers were encouraging gamers to pay more attention to these and for game creators to think about creating more non-combative games.
  • Perhaps the most interesting talk I heard was on how the gaming industry operates. This is a demanding industry that is notorious for burning out game developers. Driving that is a rather ruthless economics where a game is considered a loss if it hasn’t reached a certain level of sales. Some of the things the speaker was referencing flew by me but no doubt they would have been understood by others in the audience.

A number of the participants were clearly involved in the games industry making games at some level, either as game designers, artists, programmers, art directors, project managers, in small startups or larger, more established companies.

The second event was a smaller session The Future of Games and Entertainment at Swissnex San Francisco. This was a good event to meet others and do a bit of networking. Chuck Eyler gave a talk drawing on his own career in the film and gaming industries. On display were several interesting games that visitors could play.

While there were more games played on PCs or iPads, the main thing I took away from that event were the games that utilized different ways for players to interact with them. One game used small boats on a shallow pool of water, on which was projected a series of dots surrounding the boats. Each player fired the boat’s gun by blowing into a small round metal tube on their end of the game board. Another interesting game — although I’m not sure I’d call it a game exactly — was a picture where the subject (in this case a small child) ‘wakes up’ when a viewer approaches it and starts to mimic the facial expressions of the viewer in a manner that was both interesting yet disturbing at the same time. I’m not sure if this qualifies as an ‘uncanny valley’ type of experience, but it reminded me of that.

Games have by now become a huge thing — not just as an industry but in terms of our collective culture. I think it’s safe to say that gaming produces new sub-cultures. This has been building for a long while. There are clearly a lot of people thinking about the issues that are arising in gaming. There is a good amount of self-reflection starting to take place. Yet, I can’t help but get the impression from my day attending these events that there is a profound lack of theory here. Really I should say that there is a lack of familiarity with existing theories from the domains of cultural anthropology, economics, and so forth, as well as a lack of theories about the nature of gaming itself. Perhaps it is that it is still too early for that to have happened. Admittedly, it could be happening but it wasn’t evidenced by my day being a tourist on the fringes of the world of gaming.

Setting up and using Cron

Cron jobs are a mainstay of Unix and Linux. Cron basically lets you run a shell script or other program according to some schedule. The system has been around forever and it just plain works.. even if it is just a little on the cryptic side. The system is ubiquitous and so widely used. A common use of this is to automate backups, for instance. When administering a server I would say it is of the things to think about when setting up a system — what kinds of tasks will you need to run when? (This will, of course, evolve over time. But still….)

Pretty much every implementation of Linux/Unix does the same thing. There is a cron daemon that will go through all the cron files on the system and check in every minute to see what it needs to run. You can’t really do cron jobs any more precisely than at the minute level — there is way to say ‘run a script at 5:33:45 on Monday’ for instance (and why would you ever want to do that anyway?)

Basic Usage

Cron operates on crontab files. You don’t access these files directly to view or edit them. The system maintains these in some location. Whenever you want to do anything with a crontab file, you use the crontab command.

At the command line you can view your crontab with crontab -l. This simply prints out your crontab to stdout. If you haven’t set one up yet then you’ll most likely see a message saying something like ‘crontab: no crontab for you’ (where ‘you’ is your username on the system, of course.)

In general every user with an account has access to setting up cron jobs. To do this, run ‘crontab’ with the -e flag to edit your crontab file. That will kick you into your editor (whichever it is set to, or vi by default.) Initially this will be empty but here we’ll show you how to set that up.

[me@myserver ~]$ crontab -e

However, if you don’t want to edit the crontab file interactively, you can also edit the file in whatever editor you choose and set it up with the crontab command by giving the filename. Usually when doing that I check that I loaded in what I thought by running crontab -l right after.

[me@myserver ~]$ crontab my_crontab.txt

In a crontab file lines that start with a hash mark (‘#’) are comments. The cron program will ignore those lines.

One entry in a crontab file is set up per line. This is a space-delimited list that has the information about what program or script to run and how often. The format is:

mm hh dd tt ww command

Where each field is defined as such:

mm Number of minutes past the hour to run the cron job.

(having the minutes field first is usually the first thing that throws you off. We’re so used to using the hour figure first whenever we talk about time.)
hh Hour of day to run the cron job.
dd day of month (00 to 31)
tt month (1 to 12)
ww day of week (0 to 6, with 0=Sunday, 1=Monday, etc.)
command the absolute path to the script/program to run, along with any redirection to send output wherever.

The basic usage is pretty straight-forward. For example, we might want to run a script at 10pm every night…

0 22 * * * /backup/bar.sh > /dev/null 2>&1

Where cron can get a bit cryptic is when you want to run things multiple times a day or every few days. To run a command at two different times, say 10am and 10pm, you could put each hour value separated by a comma:

# Run foo.sh at 10:15am and 10:15pm every night and send the output to
* /dev/null (ie. throw it away)
15 10,22 * * * /home/me/foo.sh > /dev/null 2>&1

Or to run a command every four hours you would use an asterisk, but then specify the multiple of ‘every four hours’ with ‘/4’.

# Run foo.sh every 4 hours
0 */4 * * * /home/me/foo.sh > /dev/null 2>&1

You can also specify a range of values with a hyphen. In practice this probably only every makes sense in the column indicating which days of the week to run a script.

# Run bar.sh at 5:30pm every workday (mon-fri), and save the output to a logs directory…
30 17 * * 1-5 /home/me/bar.sh >> /home/me/logs/bar.log 2>&1

Version control of your crontab files

Since you can both load a crontab from a text file as well as list it out easily, it is not too much work to set up crontab files under a version control system (ie. git or subversion). Then as you add more things to your crontab you’ll have a log of these changes over time.

Considerations for scripts run via cron

There’s nothing to say you can’t run a compiled program with cron — we do this often enough. But probably the main use of cron is to run a script of some kind — BASH, Python, Perl, PHP scripts can all be run via cron.

1. At the risk of stating something kinda obvious, your script/program should start up, do it’s thing, and exit. It shouldn’t hang around running forever. If you want a daemon, set up a daemon (that’s a future blog post.)

2. Your script should use absolute paths — you can’t assume the script is run from your home directory, for instance.

3. Some jobs might take a long time to run. You should take this into account so that one invocation of your script does not overlap the next invocation by cron. Putting in a check of some kind — a sentinel or lock file — to see if your script is still running can be effective. Or, if they do overlap, that each instantiation of your script can coexist happily with the others.

*** Update: (2013-03-15) Just came across a link to Chronos, which is a replacement for Cron developed by the folks at AirBnB. I haven’t looked at it much yet but wanted to throw in a link to that here.

Which programming language/tool/other technology to learn?

When I was an undergrad studying computer science I remember walking with a couple of friends and fellow students through one of the buildings on campus on night. As it happened it was the building where the computer science department was located. We were (probably) heading to one of the computer labs, and as it happened the chair of the department had come out of the elevator and was on his way out of the building. We said hello and he then decided to take the moment to share a bit of advice, which was to learn Java. That was it, fairly simple. I think at the time we looked at each other and more or less shrugged our shoulders and got on with things.

Somehow I’ve remembered that bit of advice from our department chair since then (although certainly I didn’t act on it at the time in a meaningful way). That was about twenty years ago — perhaps 1993 or 1994 — and at the time Java was still a pretty new language, and none of our regular computer science courses was using the language. With a few exceptions most of the programming I was doing in my classes as an undergrad was in C. My introductory programming class in college used a language called Modula-2, which probably very few remember now (and I believe that was the last semester they taught that class with Module-2, switching to C afterwards.) There was also a course on programming language design introduced us to Scheme (a close cousin of LISP). The point though was that the department chair recognized that this was going to be an important language and was worth investing the time to learn it.

When getting started in this business there is an incentive to try to learn a lot of different languages, or development tools. After all, the more one knows, the more potentially valuable when hitting the job market. Or at least we might be inclined to think — a point I’ll return to later. But there are any number of things one might spend time learning and only a subset of those will prove to actually be useful. This is a classic problem of course — how do you decide what to focus on and what to set aside? A couple of years ago when I was teaching I had a student more or less ask this question as well. I think since I was an undergrad the problem has become much more challenging. There are simply more languages and applications in active use now. We have more choices for databases, and mobile app development didn’t even exist ten years ago.

Trying to decompose this problem is worth doing I think. For any given language, tool, or other bit of technology we might pose a few questions:

How widely used is it now?

A widely used technology can be comforting to get into since you may assume there will be more resources to help get up to speed with it, as well as a broader community to tap into. If we are talking about programming languages, then you might look to see which ones there seems to be more demand for. The Jobs Tractor Language Trends – February 2013 report for instance shows Java and PHP being more popular now. But a narrow market segment isn’t necessarily bad either. In some niche areas a less widely used language or technology might be the dominant one in that area. Also, new technologies have to start out somewhere and can sometimes take a while to find their audience, which leads to the second point.

Potential for growth?

Besides current popularity another variable is what kind of growth might be expected for people that know Technology X.

Don’t confuse how widely used something is with demand — they are related but not the same thing. There’s still a market for COBOL and Fortran developers, for instance.

Open vs. Proprietary

Right now there is a division between open technologies and closed, proprietary ones. With proprietary technologies you can expect your investment cost to go up if only because of the need to acquire the necessary software (and license).

Getting into mobile app development is a good case study here. Becoming an iPhone developer, for example, requires a certain up-front investment: you need a) a Mac to have XCode and the whole development environment, b) an iPhone and c) join Apple’s iPhone developer program. However, that’s a popular platform to develop for (at the risk of understatement) so yeah, it’s pretty compelling. It should be said, given this example, that becoming an Android developer isn’t free either — but you can get by with a less expensive computer with Linux and you can get Java for free.

Is it something you are interested in?

I think there’s something to be said for pursing things you actually are interested in versus things that you think will be good to know, but are otherwise not that into. You’ll generally do better at things that you are intrinsically inclined towards. At the end of the day I think this is the one to weigh most.

Avoid spreading yourself too thin

Finally, to get back to the point about trying to ‘learn everything’, at a certain point I think it is important to recognize that you can’t do possibly do that , and certainly not at the same time. There is likely an added bonus that comes from having experience with a variety of tools — seeing how different languages handle the same or similar problems can be insightful, for instance — but that kind of knowledge comes over time. Over the span of a career your primary tools are going to evolve anyway.

The list above is just an attempt to try to sketch out how to think about this problem in some systematic way; I certainly don’t think I have any definitive answers here. What I can say is that, at least for myself, I’ve decided that there are certain areas I don’t see myself investing time to learn things and instead focus on other areas. I’m less inclined to get into .Net and Windows development at this point in my career, for instance, since that would be a pretty significant switch from where my current skill set is, and quite frankly I’m not as interested in it. This is certainly not to disparage .Net and Windows development — it’s just a choice of where to spend resources.