NoSQL Zone is brought to you in partnership with:

Andreas Kollegger is a leading speaker and writer on graph databases and Neo4j and the bridge between community and developer efforts. He works actively in the community, speaking around the world and promoting the larger Neo4j ecosystem of projects. Author of Fair Trade Software, and the lead for Neo4j in the cloud, Andreas plays a valuable role for progressive happenings within Neo4j. Andreas is a DZone MVB and is not an employee of DZone and has posted 68 posts at DZone. You can read more from them at their website. View Full User Profile

Using Spring Data Neo4j for the Hubway Data Challenge

10.14.2012
| 3094 views |
  • submit to reddit

Editor's Note: This post was originally authored by Michael Hunger of the Neo4j Blog.

 

Using Spring Data Neo4j it was incredibly easy to model and import the Hubway Challenge dataset into a Neo4j graph database, to make it available for advanced querying and visualization.

The Challenge and Data

Tonight @graphmaven pointed me to the boston.com article about the Hubway Data Challenge.

Hubway is a bike sharing service which is currently expanding worldwide. In the Data challenge they offer the CSV-data of their 95 Boston stations and about half a million bike rides up until the end of September. The challenge is to provide answers to some posted questions and develop great visualizations (or UI's) for the Hubway data set. The challenge is also supported by MAPC (Metropolitan Area Planning Council).

Getting Started

As midnight had just passed and the Spring Data Neo4j 2.1.0.RELEASE was built inofficially during the day I thought it would be a good exercise to model the data using entities and importing it into Neo4j. So the first step was the domain model, which is pretty straightforward:

Based on the Spring Data book example project, I created the pom.xml with the dependencies (org.springframework.data:spring-data-neo4j:2.1.0.RELEASE) and the Spring application context files.

Import Stations

Starting with the Station in modelling and importing was the easiest. In the entity we have several names, one of which is the unique identifier (terminalName), the station name itself can be searched with a fulltext-index. As hubway also provides geo-information for the stations we use the Neo4j-Spatial index provider to later integrate with spatial searches (near, bounding box etc.)

@NodeEntity
@TypeAlias("Station")
public class Station {
    @GraphId Long id;
     
    @Indexed(numeric = false)
    private Short stationId;
    @Indexed(unique=true)
    private String terminalName;
 
    @Indexed(indexType = IndexType.FULLTEXT, indexName = "stations")
    private String name;
 
    boolean installed, locked, temporary;
 
    double lat, lon;
    @Indexed(indexType = IndexType.POINT, indexName = "locations")
    String wkt;
 
    protected Station() {
    }
 
    public Station(Short stationId, String terminalName, String name,
                   double lat, double lon) {
        this.stationId = stationId;
        this.name = name;
        this.terminalName = terminalName;
        this.lon = lon;
        this.lat = lat;
        this.wkt = String.format("POINT(%f %f)",lon,lat).replace(",",".");
    }
}

I used the JavaCSV library for reading the data files. The importer just creates a Spring contexts and retrieves the service with injected dependencies and declarative transaction management. Then the actual import is as simple as creating entity instances and passing them to the Neo4jTemplate for saving.

ClassPathXmlApplicationContext ctx = new ClassPathXmlApplicationContext("classpath:META-INF/spring/application-context.xml");
ImportService importer = ctx.getBean(ImportService.class);
 
CsvReader stationsFile = new CsvReader(stationsCsv);
stationsFile.readHeaders();
importer.importStations(stationsFile);
stationsFile.close();
 
 
public class ImportService {
 
    @Autowired private Neo4jTemplate template;
 
    private final Map<short,station> stations = new HashMap<short, station="">();
 
    @Transactional
    public void importStations(CsvReader stationsFile) throws IOException {
        // id,terminalName,name,installed,locked,temporary,lat,lng
        while (stationsFile.readRecord()) {
            Station station = new Station(asShort(stationsFile,"id"),
                                          stationsFile.get("terminalName"),
                                          stationsFile.get("name"),
                                          asDouble(stationsFile, "lat"),
                                          asDouble(stationsFile, "lng"));
            template.save(station);
            stations.put(station.getStationId(), station);
        }
    }
}
</short,></short,station>

Import trips

Importing the trips themselves is only a little more involved. In the modeling of the trip I choose to create a RelationshipEntity called Action to represent the start or end of a trip. That entity connects the trip to a station and holds the date at which it happend. During the import I found a number of data rows to be inconsistent (missing stations), so those were skipped. As half a million entries are a bit too much for a single transaction I split the import up into batches of 5k trips each.

@Transactional
public boolean importTrips(CsvReader trips, int count) throws IOException {
    //"id","status","duration","start_date","start_station_id",
    // "end_date","end_station_id","bike_nr","subscription_type",
    // "zip_code","birth_date","gender"
    while (trips.readRecord()) {
        Station start = findStation(trips, "start_station_id");
        Station end = findStation(trips, "end_station_id");
        if (start==null || end==null) continue;
 
        Member member = obtainMember(trips);
 
        Bike bike = obtainBike(trips);
 
        Trip trip = new Trip(member, bike)
                        .from(start, date(trips.get("start_date")))
                        .to(end, date(trips.get("end_date")));
        template.save(trip);
        count--;
        if (count==0) return true;
    }
    return false;
}

First look at the data

After running the import, after two minutes we have a Neo4j database (227MB) that contains all those connections. I uploaded it to our sample dataset site. Please get a Neo4j server and put the content of the zip-file into data/graph.db then it is easy to visualize the graph and run some interesting queries. I list a few but those should only be seen as a starting point, feel free to explore and find new and interesting insights.

Stations most often used by a user

 START n=node(205) 
 MATCH n-[:TRIP]->(t)-[:`START`|END]->stat 
 RETURN stat.name,count(*) 
 ORDER BY count(*) desc LIMIT 5; 

+------------------------------------------------+
| stat.name                           | count(*) |
+------------------------------------------------+
| "South Station - 700 Atlantic Ave." | 22       |
| "Post Office Square"                | 21       |
| "TD Garden - Legends Way"           | 10       |
| "Boylston St. at Arlington St."     | 5        |
| "Rowes Wharf - Atlantic Ave"        | 5        |
+------------------------------------------------+
5 rows
31 ms 

Most beloved bikes

  START bike=node:Bike("bikeId:*") 
  MATCH bike<-[:BIKE]->trip 
  RETURN bike.bikeId,count(*) 
  ORDER BY count(*) DESC LIMIT 5;

+------------------------+
| bike.bikeId | count(*) |
+------------------------+
| "B00145"    | 1074     |
| "B00114"    | 1065     |
| "B00538"    | 1061     |
| "B00490"    | 1059     |
| "B00401"    | 1057     |
+------------------------+
5 rows
2906 ms

Heroku

The data can also be easily added to a Heroku Neo4j Add-On and from there you can use any programming language and rendering framework (d3, jsplumb, raphael, processing) to visualize the dataset.

What's next

Next steps for us are to import the supplied shapefile for Boston and the stations as well into the Neo4j database and connect them with the data and create a cool visualization. I rely on @maxdemarzi for it to be awesome. Another path to follow is to craft more advanced cypher queries for exploring the dataset and making them and their results available.

Boston Hubway Data-Challenge Hackaton

Hubway will host a Hack Day at The Bocoup Loft in Downtown Boston on Saturday, October 27, 2012. Register here and spread some graph love.

 

 

 

Published at DZone with permission of Andreas Kollegger, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Pazargic Antone... replied on Fri, 2012/10/19 - 6:28am

Thank you

Broken link for sample - http://example-data.neo4j.org/. Is it possible to have it on githuib?
Could you please put pom.xml (or repos and deps settings) on article?

 Thank you in advance. 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.