NoSQL Zone is brought to you in partnership with:

Mark is a graph advocate and field engineer for Neo Technology, the company behind the Neo4j graph database. As a field engineer, Mark helps customers embrace graph data and Neo4j building sophisticated solutions to challenging data problems. When he's not with customers Mark is a developer on Neo4j and writes his experiences of being a graphista on a popular blog at http://markhneedham.com/blog. He tweets at @markhneedham. Mark is a DZone MVB and is not an employee of DZone and has posted 524 posts at DZone. You can read more from them at their website. View Full User Profile

Neo4j / Cypher: Getting Rid of an Optional Match

10.15.2013
| 2018 views |
  • submit to reddit
I was looking back over some of the queries I wrote for my football data set and I came across one I’d written to work out how many goals players scored in matches that were televised.

The data model looks like this:

2013 10 13 22 31 19

My initial query to work out the top 10 scorers in televised games was as follows:

MATCH (player:Player)
WITH player
MATCH player-[:played|subbed_on]->stats-[:in]->game-[t?:on_tv]->channel
WITH COLLECT({goals: stats.goals, type: TYPE(t)}) AS games, player
RETURN player.name,
       REDUCE(goals = 0, h IN [g IN games WHERE g.type IS NULL] | goals + h.goals) AS nonTvGoals,
       REDUCE(goals = 0, h IN [g IN games WHERE g.type <> NULL] | goals + h.goals) AS tvGoals,
       REDUCE(goals = 0, h in games | goals + h.goals) AS totalGoals
ORDER BY tvGoals DESC
LIMIT 10

If we evaluate the query we see that Robin Van Persie scored most goals in televised games:

==> +--------------------------------------------------------+
==> | player.name        | nonTvGoals | tvGoals | totalGoals |
==> +--------------------------------------------------------+
==> | "Robin Van Persie" | 11         | 15      | 26         |
==> | "Gareth Bale"      | 8          | 13      | 21         |
==> | "Luis Suárez"      | 12         | 11      | 23         |
==> | "Theo Walcott"     | 5          | 9       | 14         |
==> | "Demba Ba"         | 7          | 8       | 15         |
==> | "Edin Dzeko"       | 7          | 7       | 14         |
==> | "Santi Cazorla"    | 5          | 7       | 12         |
==> | "Juan Mata"        | 6          | 6       | 12         |
==> | "Steven Gerrard"   | 3          | 6       | 9          |
==> | "Carlos Tevez"     | 5          | 6       | 11         |
==> +--------------------------------------------------------+
==> 10 rows
==> 520 ms

The query takes 1/2 second to evaluate even after running it a few times and since using optional patterns often leads to slow queries I thought I should try and remove the optional on_tv relationship.

I introduced it so that I could determine which matches were on TV, which then allowed me to aggregate goals scored in other games as well.

Wes helped me do this although his first suggestion was to replace the REDUCE with a SUM to simplify the query. The CASE statement makes this easier:

MATCH (player:Player)
WITH player
MATCH player-[:played|subbed_on]->stats-[:in]->game-[t?:on_tv]->channel
WITH {goals: stats.goals, type: TYPE(t)} AS game, player
RETURN player.name, 
       SUM(CASE WHEN game.type IS NULL THEN game.goals ELSE 0 END) as nonTvGoals,
       SUM(CASE WHEN game.type <> NULL THEN game.goals ELSE 0 END) as tvGoals,
       SUM(game.goals) as allGoals
ORDER BY tvGoals DESC
LIMIT 10

Our next step is to split up the finding of stats nodes (which represent a player’s performance in a match) from the finding of the TV channel that the match may have been shown on.

MATCH (player:Player)-[:played|subbed_on]->stats
WITH player, stats
MATCH stats-[:in]->game-[t?:on_tv]->channel
WITH {goals: stats.goals, type: TYPE(t)} AS game, player
RETURN player.name, 
       SUM(CASE WHEN game.type IS NULL THEN game.goals ELSE 0 END) as nonTvGoals,
       SUM(CASE WHEN game.type <> NULL THEN game.goals ELSE 0 END) as tvGoals,
       SUM(game.goals) as allGoals
ORDER BY tvGoals DESC
LIMIT 10

Finally we can make the on_tv relationship explicit and change the way we sum the goals slightly:

MATCH (player:Player)-[:played|subbed_on]->stats
WITH stats.goals AS goals, player, stats-[:in]->()-[:on_tv]-() as onTv
RETURN player.name, 
       SUM(CASE WHEN onTv = FALSE THEN goals ELSE 0 END) as nonTvGoals,
       SUM(CASE WHEN onTv = TRUE THEN goals ELSE 0 END) as tvGoals,
       SUM(goals) as allGoals
ORDER BY tvGoals DESC
LIMIT 10

If we run that a few times the query execution time has now halved:

==> +------------------------------------------------------+
==> | player.name        | nonTvGoals | tvGoals | allGoals |
==> +------------------------------------------------------+
==> | "Robin Van Persie" | 11         | 15      | 26       |
==> | "Gareth Bale"      | 8          | 13      | 21       |
==> | "Luis Suárez"      | 12         | 11      | 23       |
==> | "Theo Walcott"     | 5          | 9       | 14       |
==> | "Demba Ba"         | 7          | 8       | 15       |
==> | "Santi Cazorla"    | 5          | 7       | 12       |
==> | "Edin Dzeko"       | 7          | 7       | 14       |
==> | "Carlos Tevez"     | 5          | 6       | 11       |
==> | "Juan Mata"        | 6          | 6       | 12       |
==> | "Steven Gerrard"   | 3          | 6       | 9        |
==> +------------------------------------------------------+
==> 10 rows
==> 219 ms

Thanks Wes for showing me how to do this!


Published at DZone with permission of Mark Needham, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)