NoSQL Zone is brought to you in partnership with:

I am a Webscience PhD student at the university of Koblenz and the Founder of http://www.metalcon.de Social news streams are my research interest. René is a DZone MVB and is not an employee of DZone and has posted 36 posts at DZone. You can read more from them at their website. View Full User Profile

Titan Can Handle 2400 Concurrent Users Against a Graph Cluster in Real-Time

09.06.2013
| 3397 views |
  • submit to reddit
Sorry to start with a conclusion first… To me Titan graph seems to be the egg-laying wool-milk-sow that people would dream of when working with graph data. Especially if one needs graph data in a web context in real-time. I will certainly try to free some time to check this out and get hands on.

This thing is really new and revolutionary. This is not just another Hadoop or Giraph approach for big data processing. This is distributed in real-time
! I am almost confident if the benchmark hold what it promised Titan will be one of the fastest growing technologies we have seen so far.

I met Matthias Bröcheler (CTO of Aurelius the company behind Titan graph) 4 years ago in a teaching situation for the German national student high school academy. It was the time when I was still more mathematician than computer scientist, but my journey in becoming a computer scientist had just started. Matthias was in the middle of his PhD program and I valued his insights and experiences a lot. It was for him that my eyes got opened for the first time about what big data really means and how companies like Facebook, Google, etc. knit their business model around collecting data. Matthias truly influenced me and I have a lot of respect of him.

We lost contact, and I did not start my PhD right away. I knew he was interested in graphs but that was about it. When I first started to use Neo4j, I realized that Matthias was also one of the authors of the tinkerpop blueprints, which are interfaces to talk to graphs. Most vendors of graph data bases use them. At that time, I looked him up again and I realized he was working on Titan – a distributed graph data base. I found this promising looking slide deck:

Slide 106:

Slide 107:

But at that time for me there wasn’t much evidence that Titan would really deliver on the promise that is given in slides 106 and 107. In fact, those goals seemed as crazy and unreachable as my former PhD proposal on distributed graph databases (By the way: Reading the PhD Proposal now, I am kind of amused since I did not really aim for the important points like Titan did.)

During the redesign phase of metalcon we started playing around with HBase to support the architecture of our like button and especially to be able to integrate this with recommendations coming from Mahout. I started to realize the big fundamental differences between HBase (the implementation of Google Bigtable) and Cassandra (an implementation of Amazon Dynamo) which result from the CAP theorem about distributed systems. Looking around for information about distributed storage engines, I stumbled again onto Titan, and seeing Matthias’ talk on the Cassandra summit 2013 got me excited. The 21 - 22 minute talk is really interesting. I also suggest you skip the first 15 minutes of the talk:

Let me sum up the amazing parts of the talk:

  • 2400 concurrent users against a graph cluster!
  • real time!
  • 16 different (non trivial queries) queries 
  • achieving more than 10k requests answered per second!
  • graph with more than a billion nodes!
  • graph partitioning is plugable
  • graph schema helps indexing for queries
So far I was not sure what kind of queries were really involved. Especially if there where also write transactions and unfortunately no one in the audience asked that question. So I started googleing and found this blog post by aurelius. As we can see there is an entire overview on the queries and much more detailed the results are presented. Unfortunately  I was not able to find the source code of that very benchmark (which Matthias promised to open in his talk). On Average most queries take less than half a second.
Even though the source code is not available this talk together with the Aurelius blog post looks to me like the most interesting and hottest piece of technology I came across during my PhD program. Aurelius started to think distributed right away and made some clever design decisions:
  • Scaling data size
  • scaling data access in terms of concurrent users (especially write operations) is fundamentally integrated and seems also to be successful integrated. 
  • making partitioning pluggable
  • requiring an schema for the graph (to enable efficient indexing)
  • being able on runtime to extend the schema.
  • building on top of ether Cassandra (for realtime) or HBase for consistency
  • being compatible with the tinkerpop techstack
  • bringing up an entire framework for analytics and graph processing.

Further resources:


Published at DZone with permission of René Pickhardt, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)