NoSQL Zone is brought to you in partnership with:

Max De Marzi, is a seasoned web developer. He started building websites in 1996 and has worked with Ruby on Rails since 2006. The web forced Max to wear many hats and master a wide range of technologies. He can be a system admin, database developer, graphic designer, back-end engineer and data scientist in the course of one afternoon. Max is a graph database enthusiast. He built the Neography Ruby Gem, a rest api wrapper to the Neo4j Graph Database. He is addicted to learning new things, loves a challenge and finding pragmatic solutions. Max is very easy to work with, focuses under pressure and has the patience of a rock. Max is a DZone MVB and is not an employee of DZone and has posted 60 posts at DZone. You can read more from them at their website. View Full User Profile

The Power of Open Source Software

01.07.2014
| 3403 views |
  • submit to reddit

opensource-400

One of the benefits of Open Source Software is that if you want to change how something is done, you can. At Neo Technology, we have a small team of “Field Engineers” who don’t really work ON the product but rather WITH the product. We help our customers with issues of all kinds, answer questions, give suggestions and whatever we need to do to make people’s project successful. A little while back I had a support ticket for a traversal that was taking longer than they hoped it would.

Think about a social network, one of the things you may want to do is tell the user how big their friends network is. But why stop there? How about their friends of friends or even friends of friends of friends network? These are the kind of questions graph databases excel at compared to relational databases. Let’s take a look at what they were doing:

@GET
@Path("/network/{id}")
public Response getNetworkSize(@PathParam("id") Long id, @Context GraphDatabaseService db) throws IOException {
    Node user = db.getNodeById(id);
    final Evaluator depth = Evaluators.toDepth(4);
    final TraversalDescription traversalDescription = new TraversalDescriptionImpl().
             breadthFirst().evaluator(depth).relationships(FRIENDS, Direction.OUTGOING);
     
    final Integer[] sizes = new Integer[] { 0, 0, 0, 0, 0 };
 
    for (final org.neo4j.graphdb.Path path : traversalDescription.traverse(user))
         sizes[path.length()]++;
    }
 
    return Response.ok().entity(objectMapper.writeValueAsString(sizes)).build();
}

What you are looking at is an unmanaged extension that takes a node id parameter and uses that node as the starting point of a traversal. The Traversal is going to traverse breadth first along the FRIENDS relationship up to a depth 4 hops and at each step in the path it is going to increment the counters inside the sizes array. It’s not obvious here, but the default Uniqueness in a Neo4j Traversal is Global Uniqueness, so a node won’t be visited twice if we’ve already seen it.

To test this out, I created a graph with a million nodes and randomly created 14.5 million unique relationships between them. I can query it via REST and it looks like this:

http> get /example/service/network/1
==> 200 OK
==> [1,20,230,3446,48444]

So how fast is it? I turn to the tried and true Gatling Tool and fed it a list of node ids for 20 seconds, repeating the test a few times:

import com.excilys.ebi.gatling.core.Predef._
import com.excilys.ebi.gatling.http.Predef._
import akka.util.duration._
import bootstrap._
 
class NetworkSize extends Simulation {
  val httpConf = httpConfig
    .baseURL("http://localhost:7474")
    .acceptHeader("application/json")
 
  val testfile = csv("test-data.txt").circular
 
  val scn = scenario("Network Size")
    .during(20) {
    feed(testfile)
    .exec(
      http("Test Network Size")
        .get("/example/service/network/${id}")
        .check(status.is(200)))
      .pause(0 milliseconds, 1 milliseconds)
  }
 
  setUp(
    scn.users(8).protocolConfig(httpConf)
  )
}

…and the best result:

Screen Shot 2013-12-31 at 3.30.02 PM

About 60 requests per second with a mean latency of 135ms and a max of 640ms. It’s not bad, but can it go faster? I did a little digging to find out what is going on under the covers and found something interesting in how the Global Uniqueness of our nodes is maintained.

class GloballyUnique extends AbstractUniquenessFilter
{
    private final Set<Long> visited = new HashSet<Long>();
     
    GloballyUnique( PrimitiveTypeFetcher type )
    {
        super( type );
    }
 
    public boolean check( TraversalBranch branch )
    {
        return visited.add( type.getId( branch ) );
    }
...

In the code, we have a Set of Longs that helps us tell if we have seen a node yet. It makes sense, it works, but what if we went a different way. Instead of storing Longs, we can use a BitSet to store true or false whether or not we have seen a node id. Ran into a small problem though… the standard Java BitSet is limited to 2,147,483,647 (an int) and we can have tens of billions of nodes. Not to fear however, OpenBitSet to the rescue. This little class from Lucene can handle up to 64 * 2**32-1 bits, I’ll give you a second to do the math… that’s 274,877,906,943 bits. Perfect.

So I pulled the 1.9 branch from the Neo4j repository from github and made an edit:

class GloballyUnique extends AbstractUniquenessFilter
{
    private final OpenBitSet visited = new OpenBitSet();
  
    GloballyUnique( PrimitiveTypeFetcher type )
    {
        super( type );
    }
  
    public boolean check( TraversalBranch branch )
    {
        if ( visited.get( type.getId( branch ) ) ) {
            return false;
        } else {
            visited.set( type.getId( branch ) );
            return true;
        }
    }
...

I also added “Lucene Core” to the LICENSES.txt and NOTICES.txt files, and modified the pom.xml file:

<properties>
...
  <lucene.version>3.6.2</lucene.version>
</properties>
...
<dependency>
  <groupId>org.apache.lucene</groupId>
   <artifactId>lucene-core</artifactId>
   <version>${lucene.version}</version>
</dependency>

and then in the /neo4j/community/kernel directory I ran:

mvn clean install -DminimalBuild -Dlicense.skip=true

and copied the target/neo4j-kernel-1.9.5.jar into my neo4j/lib directory overwriting the existing kernel and reran my tests:

Screen Shot 2013-12-31 at 3.54.26 PM

Now we’re up to almost 100 requests per second with a mean of 81ms and a maximum of 420ms, almost doubling our previous performance numbers. Not bad for a few lines of code. So please, take a look under the hood of Neo4j and if you see something you don’t like, change it… and send us your pull requests!

By the way… you can go faster if you needed to, but it’s not pretty. I call it the OpenBitSet Dance:

@GET
@Path("/network/{id}")
public Response getNetworkSize(@PathParam("id") Long id, @Context GraphDatabaseService db) throws IOException {
    Node user = db.getNodeById(id);
 
    final Long[] sizes = new Long[] { 1L, 0L, 0L, 0L, 0L };
    OpenBitSet[] seen = new OpenBitSet[]{new OpenBitSet(),new OpenBitSet(),new OpenBitSet(),new OpenBitSet(),new OpenBitSet() };
    seen[0].set(user.getId());
 
    for(Relationship level1 : user.getRelationships(Direction.OUTGOING, FRIENDS )){
        Node friend = level1.getEndNode();
        seen[1].set(friend.getId());
    }
 
    for (int i = seen[1].nextSetBit(0); i >= 0; i = seen[1].nextSetBit(i+1)) {
        Node friend = db.getNodeById((long)i);
        for(Relationship level2 : friend.getRelationships(Direction.OUTGOING, FRIENDS )){
            Node fof = level2.getEndNode();
            seen[2].set(fof.getId());
        }
    }
 
    for (int i = seen[2].nextSetBit(0); i >= 0; i = seen[2].nextSetBit(i+1)) {
        Node friend = db.getNodeById((long)i);
        for(Relationship level2 : friend.getRelationships(Direction.OUTGOING, FRIENDS )){
            Node fof = level2.getEndNode();
            seen[3].set(fof.getId());
        }
    }
 
    for (int i = seen[3].nextSetBit(0); i >= 0; i = seen[3].nextSetBit(i+1)) {
        Node friend = db.getNodeById((long)i);
        for(Relationship level2 : friend.getRelationships(Direction.OUTGOING, FRIENDS )){
            Node fof = level2.getEndNode();
            seen[4].set(fof.getId());
        }
 
    }
    seen[1].andNot(seen[0]);
    seen[2].andNot(seen[1]);
    seen[2].andNot(seen[0]);
    seen[3].andNot(seen[2]);
    seen[3].andNot(seen[1]);
    seen[3].andNot(seen[0]);
    seen[4].andNot(seen[3]);
    seen[4].andNot(seen[2]);
    seen[4].andNot(seen[1]);
    seen[4].andNot(seen[0]);
    sizes[0] = seen[0].cardinality();
    sizes[1] = seen[1].cardinality();
    sizes[2] = seen[2].cardinality();
    sizes[3] = seen[3].cardinality();
    sizes[4] = seen[4].cardinality();
 
    return Response.ok().entity(objectMapper.writeValueAsString(sizes)).build();
}

If you find a way to go faster still, please let me know!


Published at DZone with permission of Max De Marzi, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)