NoSQL Zone is brought to you in partnership with:

MySQL vs. Neo4j on a Large-Scale Graph Traversal

12.05.2011
| 24600 views |
  • submit to reddit

This post presents an analysis of MySQL (a relational database) and Neo4j (a graph database) in a side-by-side comparison on a simple graph traversal.



The data set that was used was an artificially generated graph with natural statistics. The graph has 1 million vertices and 4 million edges. The degree distribution of this graph on a log-log plot is provided below. A visualization of a 1,000 vertex subset of the graph is diagrammed above.

Loading the Graph

The graph data set was loaded both into MySQL and Neo4j. In MySQL a single table was used with the following schema.

CREATE TABLE graph (
  outV INT NOT NULL,
  inV INT NOT NULL
);
CREATE INDEX outV_index USING BTREE ON graph (outV);
CREATE INDEX inV_index USING BTREE ON graph (inV);

After loading the data, the table appears as below. The first line reads: “vertex 0 is connected to vertex 1.”

mysql> SELECT * FROM graph LIMIT 10;
+------+-----+
| outV | inV |
+------+-----+
|    0 |   1 |
|    0 |   2 |
|    0 |   6 |
|    0 |   7 |
|    0 |   8 |
|    0 |   9 |
|    0 |  10 |
|    0 |  12 |
|    0 |  19 |
|    0 |  25 |
+------+-----+
10 rows in set (0.04 sec)

The 1 million vertex graph data set was also loaded into Neo4j. In Gremlin, the graph edges appear as below. The first line reads: “vertex 0 is connected to vertex 992915.”

gremlin> g.E[1..10]
==>e[183][0-related->992915]
==>e[182][0-related->952836]
==>e[181][0-related->910150]
==>e[180][0-related->897901]
==>e[179][0-related->871349]
==>e[178][0-related->857804]
==>e[177][0-related->798969]
==>e[176][0-related->773168]
==>e[175][0-related->725516]
==>e[174][0-related->700292]

Warming Up the Caches

Before traversing the graph data structure in both MySQL and Neo4j, each database had a “warm up” procedure run on it. In MySQL, a “SELECT * FROM graph” was evaluated and all of the results were iterated through. In Neo4j, every vertex in the graph was iterated through and the outgoing edges of each vertex were retrieved. Finally, for both MySQL and Neo4j, the experiment discussed next was run twice in a row and the results of the second run were evaluated.

Traversing the Graph

The traversal that was evaluated on each database started from some root vertex and emanated n-steps out. There was no sorting, no distinct-ing, etc. The only two variables for the experiments are the length of the traversal and the root vertex to start the traversal from. In MySQL, the following 5 queries denote traversals of length 1 through 5. Note that the “?” is a variable parameter of the query that denotes the root vertex.

SELECT a.inV FROM graph as a WHERE a.outV=?

SELECT b.inV FROM graph as a, graph as b WHERE a.inV=b.outV AND a.outV=?

SELECT c.inV FROM graph as a, graph as b, graph as c WHERE a.inV=b.outV AND b.inV=c.outV AND a.outV=?

SELECT d.inV FROM graph as a, graph as b, graph as c, graph as d WHERE a.inV=b.outV AND b.inV=c.outV AND c.inV=d.outV AND a.outV=?

SELECT e.inV FROM graph as a, graph as b, graph as c, graph as d, graph as e WHERE a.inV=b.outV AND b.inV=c.outV AND c.inV=d.outV AND d.inV=e.outV AND a.outV=?

For Neo4j, the Blueprints Pipes framework was used. A pipe of length n was constructed using the following static method.

public static Pipeline createPipeline(final Integer steps) {
  final ArrayList pipes = new ArrayList();
  for (int i = 0; i < steps; i++) {
    Pipe pipe1 = new VertexEdgePipe(VertexEdgePipe.Step.OUT_EDGES);
    Pipe pipe2 = new EdgeVertexPipe(EdgeVertexPipe.Step.IN_VERTEX);
    pipes.add(pipe1);
    pipes.add(pipe2);
  }
  return new Pipeline(pipes);
}

For both MySQL and Neo4j, the results of the query (SQL and Pipes) were iterated through. Thus, all results were retrieved for each query. In MySQL, this was done as follows.

while (resultSet.next()) {
  resultSet.getInt(finalColumn);
}

In Neo4j, this is done as follows.

while (pipeline.hasNext()) {
  pipeline.next();
}

Experimental Results

The artificial graph dataset was constructed with a “rich get richer“, preferential attachment model. Thus, the vertices created earlier are the most dense (i.e. highest number of adjacent vertices). This property was used to limit the amount of time it would take to evaluate the tests for each traversal. Only the first 250 vertices were used as roots of the traversals. Before presenting timing results, note that all of these experiments were run on a MacBook Pro with a 2.66GHz Intel Core 2 Duo and 4Gigs of RAM at 1067 MHz DDR3. The packages used were Java 1.6, MySQL JDBC 5.0.8, and Blueprints Pipes 0.1.2.

java version "1.6.0_17"
Java(TM) SE Runtime Environment (build 1.6.0_17-b04-248-10M3025)
Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01-101, mixed mode)

The following Java Virtual Machine parameters were used:

-Xmx1000M -Xms500M

Below are the total running times for both MySQL (red) and Neo4j (blue) for traversals of length 1, 2, 3, and 4.

The raw data is presented below along with the total number of vertices returned by each traversal—which, of course, is the same for both MySQL and Neo4j given that its the same graph data set being processed. Also realize that traversals can loop and thus, many of the same vertices are returned multiple times. Finally, note that only Neo4j has the running time for a traversal of length 5. MySQL did not finish after waiting 2 hours to complete. In comparison, Neo4j took 14.37 minutes to complete a 5 step traversal.

[mysql steps-1] time(ms):124 -- vertices_returned:11360
[mysql steps-2] time(ms):922 -- vertices_returned:162640
[mysql steps-3] time(ms):8851 -- vertices_returned:2206437
[mysql steps-4] time(ms):112930 -- vertices_returned:28125623
[mysql steps-5] N/A

[neo4j steps-1] time(ms):27 -- vertices_returned:11360
[neo4j steps-2] time(ms):474 -- vertices_returned:162640
[neo4j steps-3] time(ms):3366 -- vertices_returned:2206437
[neo4j steps-4] time(ms):49312 -- vertices_returned:28125623
[neo4j steps-5] time(ms):862399 -- vertices_returned:358765631

Next, the individual data points for both MySQL and Neo4j are presented in the plot below. Each point denotes how long it took to return n number of vertices for the varying traversal lengths.

Finally, the data below provides the number of vertices returned per millisecond (on average) for each of the traversals. Again, MySQL did not finish in its 2 hour limit for a traversal of length 5.

[mysql steps-1] vertices/ms:91.6128847554668
[mysql steps-2] vertices/ms:176.399127537985
[mysql steps-3] vertices/ms:249.286746556076
[mysql steps-4] vertices/ms:249.053599519823
[mysql steps-5] N/A

[neo4j steps-1] vertices/ms:420.740351166341
[neo4j steps-2] vertices/ms:343.122344772028
[neo4j steps-3] vertices/ms:655.507125256186
[neo4j steps-4] vertices/ms:570.360621871775
[neo4j steps-5] vertices/ms:416.00886711325

Conclusion

In conclusion, given a traversal of an artificial graph with natural statistics, the graph database Neo4j is more optimal than the relational database MySQL. However, no attempts have been made to optimize the Java VM, the SQL queries, etc. These experiments were run with both Neo4j and MySQL “out of the box” and with a “natural syntax” for both types of queries.

Source: http://markorodriguez.com/2011/02/18/mysql-vs-neo4j-on-a-large-scale-graph-traversal/


Published at DZone with permission of Marko Rodriguez, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Thomas Eichberger replied on Mon, 2011/12/05 - 2:57am

Thank you very much for the interesting comparision.

André Pankraz replied on Mon, 2011/12/05 - 3:15am

interesting...i thought the difference would be _much_ bigger.

unoptimized SQL only 2-3 times slower than the highly specialized graph DB?

 

and now we take a real database and some database admin know how and...?

then we take into account: mature products (obscure 0.1 libs...yeah), standard SQL (this gremlin syntax...not for me), many SQL based tools like reporting - or maybe we like some binary data in the nodes? references to not graph based data? hmmm...somewhat disappointed.

 

Matthew Montgomery replied on Mon, 2011/12/05 - 3:36pm

Hrm,  This doesn't say which version of mysqld is used, neither does it say which storage engine.  Perhaps a more interesting comparison would be comparing Neo4j and MySQL ndbcluster engine which in ndb-7.2 can push joins down to the storage and process them in parallel over multiple cores and machines whereas InnoDB or MyISAM can use only 1 thread per query.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.