NoSQL Zone is brought to you in partnership with:

Max De Marzi, is a seasoned web developer. He started building websites in 1996 and has worked with Ruby on Rails since 2006. The web forced Max to wear many hats and master a wide range of technologies. He can be a system admin, database developer, graphic designer, back-end engineer and data scientist in the course of one afternoon. Max is a graph database enthusiast. He built the Neography Ruby Gem, a rest api wrapper to the Neo4j Graph Database. He is addicted to learning new things, loves a challenge and finding pragmatic solutions. Max is very easy to work with, focuses under pressure and has the patience of a rock. Max is a DZone MVB and is not an employee of DZone and has posted 60 posts at DZone. You can read more from them at their website. View Full User Profile

Online Payment Risk Management with Neo4j

  • submit to reddit


 I really like this saying by Corey Lanum:

Finding the relationships that should not be there is a great use case for Neo4j, and today I want to highlight an example of why. When you purchase something online, the merchant hands off your information to the payment gateway which processes your actual payment. Before they accept the transaction, they run it via series of risk management tests to validate that it is a real transaction and protect themselves from fraud. One of the hardest things for SQL based systems to do is cross check the incoming payment information against existing data looking for relationships that shouldn’t be there.

For example, given a credit card number, a phone number, email address and an IP address find:

1. How many unique phone numbers, emails and IP addresses are tied to the given credit card.
2. How many unique credit cards, emails, and IP addresses are tied to the given phone number.
3. How many unique credit cards, phone numbers and IP addresses are tied to the given email.
4. How many unique credit cards, phone numbers and emails are tied to the given IP address.

A high number of connections could mean a high potential for fraud. Given that the user is sitting there in front of their computer waiting to see if the merchant accepted their credit card, these queries need to return as fast as possible and in great number to handle peaks. So we’re going to build an unmanaged extension to perform this query quickly over the REST API, a data generator to give us something to test against, and a performance test to see just how fast Neo4j can answer these types of queries.

We’ll start with a unit test, so let’s build some data:

Node cc1 = createNode(db, "1", "cc");
Node phone1 = createNode(db, "1234567890", "phone");
Node email1 = createNode(db, "", "email");
Node ip1 = createNode(db, "", "ip");
Node cc2 = createNode(db, "2", "cc");

Our createNode method, creates a node, adds the type property set to the value we passed in and adds the newly created node to an index of its type.

private Node createNode(GraphDatabaseService db, String value, String type) {
    Index<Node> index = db.index().forNodes(type + "s");
    Node node = db.createNode();
    node.setProperty(type, value);
    index.add(node, type, value);
    return node;

We’ll also need to create some relationships to tie them together:

cc1.createRelationshipTo(phone1, RELATED);
cc1.createRelationshipTo(email1, RELATED);
cc1.createRelationshipTo(ip1, RELATED);

Since we’ll be using this over the REST API we’ll prepare a request in JSON format, and pass it to our crossReference method (which we’ll write next) and check the actual response against our expected value

public void crossReference1() throws IOException {
    String requestOne;
    requestOne = "{\"cc\" : \"1\","
            + "\"phone\" : \"1234567890\", "
            + "\"email\" : \"\", "
            + "\"ip\" : \"\"}";
    Response response = service.crossReference(requestOne, db);
    List<HashMap<String,Integer>> actual = objectMapper.readValue((String) response.getEntity(), List.class);
    ...prepare expected value...
    assertEquals(expected, actual);

We’ll expect a JSON POST request with a hash of the 4 attributes of our payment, and prepare a result list which will hold our answers:

public Response crossReference(String body, @Context GraphDatabaseService db) throws IOException {
    List<Map<String, AtomicInteger>> results = new ArrayList<Map<String, AtomicInteger>>();
    HashMap input = objectMapper.readValue( body, HashMap.class);

Then we’ll look up the credit card, phone number, email and ip in their respective index and add them to an array of nodes:

ArrayList<Node> nodes = new ArrayList<Node>();
IndexHits<Node> ccIndex = db.index().forNodes("ccs").get("cc", input.get("cc"));
IndexHits<Node> phoneIndex = db.index().forNodes("phones").get("phone", input.get("phone"));
IndexHits<Node> emailIndex = db.index().forNodes("emails").get("email", input.get("email"));
IndexHits<Node> ipIndex = db.index().forNodes("ips").get("ip", input.get("ip"));
nodes.add (ccIndex.getSingle());
nodes.add (phoneIndex.getSingle());
nodes.add (emailIndex.getSingle());
nodes.add (ipIndex.getSingle());

For each of the nodes, we’ll start with an empty map of counters, and traverse the “RELATED” relationship in both directions, incrementing the type of node we find on the other end in our map:

for(Node node : nodes){
    HashMap<String, AtomicInteger> crosses = new HashMap<String, AtomicInteger>();
    crosses.put("ccs", new AtomicInteger(0));
    crosses.put("phones", new AtomicInteger(0));
    crosses.put("emails", new AtomicInteger(0));
    crosses.put("ips", new AtomicInteger(0));
    if(node != null){
        for ( Relationship relationship : node.getRelationships(RELATED, Direction.BOTH) ){
            Node thing = relationship.getOtherNode(node);
            String type = thing.getPropertyKeys().iterator().next() + "s";

Finally we’ll return our results:

return Response.ok().entity(objectMapper.writeValueAsString(results)).build();

… and that’s it. Seriously. Our results are very simple since they are meant to be parsed and processed by another method that does the actual risk analysis. In the sample result below, the credit card used returned 4 ips, 7 emails and 3 phone numbers which increases the odds that it may be fraudulent.

[{"ips":4,"emails":7,"ccs":0,"phones":4}, -- cc returned 4 ips, 7 emails, and 3 phones.
{"ips":1,"emails":1,"ccs":1,"phones":0}, -- phone returned just 1 item for each cross reference check.
{"ips":2,"emails":0,"ccs":4,"phones":3}, -- email returned 2 ips, 4 credit cards and 3 phones.
{"ips":0,"emails":1,"ccs":3,"phones":2}] -- ip returned 3 credit cards and 2 phones.

Now that we have our method and unit test passing, we need to generate some data. We’ll start with the root of where this data comes from which is processed transactions. We’ll create 50k transactions, and every 100 we’ll generate some potentially fraudulent data by adding between 1 to 10 additional transactions that share some of the same fields. To make our life easier, we’ll use a random number to represent the hashed credit card number and use the Faker Gem to build realistic data for our other fields:

transactions ="transactions.csv", "a")
50000.times do |t|
  values = [rand.to_s[2..8], Faker::PhoneNumber.short_phone_number,, Faker::Internet.ip_v4_address]
  transactions.puts values.join(",")
  if (t%100 == 0)
    rand(1..10).times do
      # Select 1, 2 or 3 fields to change
      change = [0,1,2,3].sample(rand(1..3))
      newvalues = [rand.to_s[2..8], Faker::PhoneNumber.short_phone_number,, Faker::Internet.ip_v4_address]
      change.each do |c|
        values[c] = newvalues[c]
      transactions.puts values.join(",")

With our transactions.csv file we’ll next extract the unique credit cards, phones, emails and ips into their own files:

CSV.foreach('transactions.csv', :headers => true) do |row|
  ccs.puts row[0]
  phones.puts row[1]
  emails.puts row[2]
  ips.puts row[3]
%x[awk ' !x[$0]++' ccs.csv > ccs_unique.csv]
%x[awk ' !x[$0]++' phones.csv > phones_unique.csv]
%x[awk ' !x[$0]++' emails.csv > emails_unique.csv]
%x[awk ' !x[$0]++' ips.csv > ips_unique.csv] 

…and we’ll do the same thing for the relationships:

CSV.foreach('transactions.csv', :headers => true) do |row|
  ccs_to_phones.puts [row[0], row[1], "RELATED"].join("\t")
  ccs_to_emails.puts [row[0], row[2], "RELATED"].join("\t")
  ccs_to_ips.puts [row[0], row[3], "RELATED"].join("\t")
  phones_to_emails.puts [row[1], row[2], "RELATED"].join("\t")
  phones_to_ips.puts [row[1], row[3], "RELATED"].join("\t")
  emails_to_ips.puts [row[2], row[3], "RELATED"].join("\t")
%x[awk ' !x[$0]++' ccs_to_phones.csv > ccs_to_phones_unique.csv]
%x[awk ' !x[$0]++' ccs_to_emails.csv > ccs_to_emails_unique.csv]
%x[awk ' !x[$0]++' ccs_to_ips.csv > ccs_to_ips_unique.csv]
%x[awk ' !x[$0]++' phones_to_emails.csv > phones_to_emails_unique.csv] 
%x[awk ' !x[$0]++' phones_to_ips.csv > phones_to_ips_unique.csv] 
%x[awk ' !x[$0]++' emails_to_ips.csv > emails_to_ips_unique.csv]  

With our data generated, we are now ready to import it into Neo4j using the Batch Importer. Much has changed since my last blog post about the batch importer. Michael Hunger has made our life easier by allowing us to specify a way to look up nodes by an indexed property instead of having to come up with their node ids directly. The emails_unique.csv now looks like this:


Where the header is telling us that it’s an “email” property of type “string” indexed in the “emails” index. We’ll setup our file to use all the unique csv files we created and configure our indexes for us as well.


Now we can run the batch importer to load our data:

java -server -Xmx4G -jar batch-import-jar-with-dependencies.jar neo4j/data/graph.db

After we configure our unmanaged extension and start the server, we can write our performance test using Gatling as we’ve done before. We’ll use the transactions.csv file we created earlier as our test data, and send a JSON string containing our values to the URL we setup earlier:

class TestCrossReference extends Simulation {
  val httpConf = httpConfig
  val testfile = csv("transactions.csv").circular
  val scn = scenario("Cross Reference via Unmanaged Extension")
    .during(30) {
      http("Post Cross Reference Request")
        .body("""{"cc": "${cc}", "phone": "${phone}", "email": "${email}", "ip": "${ip}" }""")
      .pause(0 milliseconds, 1 milliseconds)

…and drumroll please:


1246 requests per second with a mean latency of 11ms on my laptop. As long as your dataset can be held in memory Neo4j will maintain these numbers regardless of your overall database size since performance is only affected by the number of relationships traversed in each query. I’ve already shown you how you can Scale UP, if you need more throughput, then a cluster of Neo4j instances can deliver it by scaling out. The code for everything shown here is available on github as always, so please don’t take my word for it and try it out yourself.

Published at DZone with permission of Max De Marzi, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)