NoSQL Zone is brought to you in partnership with:

Coming from a background of Aerospace Engineering, John soon discovered that his true interest lay at the intersections of information technology and entrepreneurship (and when applicable - math). In early 2011, John stepped away from his day job to take up software consulting. Finally John found permanent employment at Opensource Connections where he currently consults large enterprises about full-text search and Big Data applications. Highlights to this point have included prototyping the future of search with the US Patent and Trademark Office, implementing the search syntax used by patent examiners, and building a Solr search relevancy tuning framework called SolrPanl. John is a DZone MVB and is not an employee of DZone and has posted 23 posts at DZone. You can read more from them at their website. View Full User Profile

Getting Started with Neo4J Using Your Twitter Data

12.02.2013
| 4454 views |
  • submit to reddit

When learning a new technology it’s best to have a toy problem in mind so that you’re not just reimplementing another glorified “Hello World” project. Also, if you need lots of data, it’s best to pull in a fun data set that you already have some familiarity with. This allows you to lean upon already established intuition of the data set so that you can more quickly make use of the technology. (And as an aside, this is just why we so regularly use the StackExchange SciFi data set when presenting our new ideas about Solr.)

When approaching a graph database technology like Neo4J, if you’re as avid of a Twitter user as I am then POOF you already have the best possible data set for becoming familiar with the technology — your own Social network. And this blog post will help you download and setup Neo4J, set up a Twitter app (needed to access the Twitter API), pull down your social network as well as any other social network you might be interested in. At that point we’ll interrogate the network using the Neo4J and the Cypher syntax. Let’s go!

Installing and Setting Up Neo4J

Since we’re not setting Neo4J up for production use, this part’s real easy. Just go to the Neo4J download page, click on that giant blue download button, and 36.1M later you’ll have your very own copy of Neo4J. Unzip it to some reasonable place on your machine, cd into that directory, and simply issue the command bin/neo4j start. (Once you’re finished, a bin/neo4j stop will shut Neo4J down.) Now if you point your browser at http://localhost:7474 and see stuff (rather than lack of stuff), then you’re ready to start shoveling data into Neo4J.

Prepping Twitter

You’ll need to create a Twitter app before you can start pulling down your connections because you need the app’s credentials in order to access Twitter’s API. But don’t sweat it, this literally takes less than a minute. Just go to the Twitter developer apps page, sign in, and there will be yet another big blue button, this time labeled “Create a new application” — click it! After filling out a really short form, checking the “I blindly agree to whatever is included in this legal contract” checkbox, entering a CAPTCHA string, and clicking the “Create your own Twitter application” button, you will indeed have your very own Twitter app. You’ll be taken to a screen that contains the details for your new app, but most importantly the OAuth credentials. Initially, you won’t have the access tokens, but you can click the “Create access tokens” button at the bottom and next time you refresh the page (wait a few seconds) you’ll see that the access keys are available. Keep track of the credentials here because you’ll need to refer to them soon.

Scraping Your Social Circles from Twitter

Check out my Python TwitterScraper script. Though it’s not yet the most beautiful code, it doesn’t really matter, because there’s not much here! Let’s take a moment to walk through it. The first section is where you set up Twitter and Neo4J. Naturally you’ll need to pip install the Tweepy and Py2Neo libraries, but they don’t have any weird dependencies, so this shouldn’t be a problem. Also notice, this is where all the access keys for your Twitter app should be used. Go ahead and copy and paste your credentials there. Now you should be ready to go.

The remaining code includes two functions. The first, create_or_get_node, creates, or gets a node (in this case a Twitter user) from Neo4J by id_str, and if it’s creating the node for the first time, it also inserts all of the relevant user metadata into Neo4J. Also, the create_or_get_node optionally takes a list of labels that will later be used to group certain users together. The second function. insert_user_with_friends, takes a Twitter user (via their screen name), pulls that all relevant metadata for that user from the Twitter API and inserts it into Neo4J. This function will then do the same thing for all the individuals that this Twitter user follows. And finally, insert_user_with_friends will establish a FOLLOWS relationship linking the source Twitter user to those that she follows. Again here, insert_user_with_friends takes an optional list of labels that can be used to group the seed nodes (those that are followed do not get labeled).

The last bit of the script is the fun part. This is where you programmatically lay out the social networks and individuals that you want to stalk… er, uh… observe. For your convenience, I’ve added all of the OpenSource Connections team, as well as several notable individuals from the Neo4J community. I’ve also included grouping labels that I though were pretty reasonable descriptors for these individuals and groups. As that last comment in the code states, make sure to add several people that you follow as well. Remember, the goal here is to create a data set that you are eminently familiar with. Once you’re happy with the data set, the run it: python TwitterScraper.py. It will pull down twitter users 200 at a time and insert them into Neo4J as fast as possible. Soon the program will hit Twitter’s rate limit cutoff, at which point, the script will wait until the rate limit has been lifted and will continue pulling down the rest of the data. All together, you can plan on getting around 200 updates per minute.

Start Infiltrating the Social Network!

Now for the fun part; let’s start putting some queries together and pulling back interesting data. In all of the example’s below, we will be using the default Neo4J browser which you’ll still find at http://localhost:7474/. Here’s we’re using the Cypher query language. This blog post won’t go into too much detail about Cypher syntax itself, but feel free to look at the very rich Neo4J documentation. Also, I’ll be using my own Twitter screen name “JnBrymn” as an example, so feel free to replace my screen name with your own and try the queries for yourself.

First off, let’s make sure the data we’ve ingested seems reasonable. The most obvious thing to do is to make sure we’re actually in the data set:

MATCH (n {screen_name:"JnBrymn" }) 
RETURN n

Up pops an orange node representing me. And if I click on the node, I see a list of all my metadata.

Screen Shot 2013-11-27 at 12.57.12 AM

Published at DZone with permission of John Berryman, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)