I started playing with Neo4J (the graph database). It seems like great software, but I found the documentation pretty sparse and opaque. It took me 16 hours to do all of this, so I’m hoping it will save you 15.5 hours of fumbling around the Internet.

Why is Neo4j powerful?

“A traditional relational database may tell you the average age of everyone in this pub, but a graph database will tell you who is most likely to buy you a beer.”

Andreas Kollegger


To install on Mac, use homebrew on terminal:

“brew install neo4j”

Start the server.

“neo4j start”

Note: stop the server with the “neo4j stop” command

The database exists in the “/usr/local/Cellar/neo4j/1.9.4/libexec/data/graph.db/” directory. If you delete the files in here, you’ve deleted your database (and can repopulate it). I’ve nuked by database a few times while debugging. Note that homebrew install Neo4j version 1.9.4 and that the latest version is currently 2.0.

Getting Data In

I tried using Gephi and the Neo4j plugin that is available. I wasn’t quite able to get the data to look the way I wanted. I would have loved to use this plugin, but I just couldn’t figure out how to get everything to play together nicely. Plus, whenever I’d export the neo4j database it would corrupt the files and the server would crash–I’m sure this had more to do with what I was doing and less with the software though.

There is a Python wrapper you can download. Two Javascript wrapper exist, but I haven’t used either of them–maybe in the future if I create a front-end. Install the Python library using the typical commands:

“pip install neo4jrestclient” or “easy_install neo4sjrestclient”

Here is my code to get the data into the database (Python):

# This program takes the PPI data from ConsensusDB and populates the NEO4J database with it
# By Nikhil Gopal
# To run: python populate_running_db.py ConsensusDB_Human_PPI nodes_list.txt
# The code to generate the nodes_list.txt file exists here: https://gist.github.com/ngopal/9164294

import os, sys
from neo4jrestclient.client import GraphDatabase
from itertools import chain, combinations

# Connect to graph database
gdb = GraphDatabase("http://localhost:7474/db/data/")

# Create a dictionary where gene names correspond to node objects
node_objects = {}
for i in open(sys.argv[2], 'r').readlines():
	if 'Label' not in i:
		node_objects[i.strip('\n')] = "null"

# Go through the PPI file
for i in open(sys.argv[1], 'r').readlines():
	if '#' in i:
		line = i.strip('\r\n').split('\t')
		genes = line[2].replace('_HUMAN','').split(',')
		score = line[3]
		# create various combinations of genes and confidence scores
		# i.e. [gene1, gene2, gene3] becomes [(gene1,gene2), (gene2,gene3), (gene1,gene3)]
		# and each combination is associated with score1
		combos = [l for l in list(combinations(genes, 2))]
		for gene1, gene2 in combos:
			if "null" in node_objects[gene1]:
				node_objects[gene1] = gdb.nodes.create(name=gene1) #node object has a property 'name' set to the value of gene1
			if "null" in node_objects[gene2]:
				node_objects[gene2] = gdb.nodes.create(name=gene2) #node object has a property 'name' set to the value of gene2
			# node object has a relationship 'interacts' with another node object, and the relationship has a property set to
			# the value of score
			node_objects[gene1].relationships.create("interacts", node_objects[gene2], confidence=score)

We can watch the number of nodes and relationships grow.

Screenshot 2014-02-22 15.45.18

Now we can run a basic query on the data from node to node using Cypher. Here is the code:

START source=node(10), destination=node(144)
MATCH p = allShortestPaths(source-[r:interacts*..3]->destination)

And here is the output:

Screenshot 2014-02-22 15.46.28

And another example:

Screenshot 2014-02-22 15.47.04

I’m planning on eventually visualizing this data in D3. If you want to stay in the loop, sign up and I’ll keep you updated (your email won’t be shared with anyone else):