NoSQL Zone is brought to you in partnership with:

Baxter Denney is Director of Marketing at Couchbase. Couchbase is the NoSQL leader, with production deployments at AOL, Deutsche Post, NTT Docomo, Salesforce.com, Starbucks, Zynga, and hundreds of other global enterprises. Couchbase Server, our NoSQL database offering, delivers a more scalable, high-performance and cost-effective approach to data management than relational database technology. It is particularly well suited for storing the data behind web applications deployed on modern virtualized or cloud infrastructures. Baxter has posted 4 posts at DZone. You can read more from them at their website. View Full User Profile

Fun with Couchbase and Markov Chains

11.26.2012
| 3349 views |
  • submit to reddit

I’ve been hearing about Markov chains for long enough – it was time that I learned more about them and develop a simple fun markov chain application. I’m sure that you don’t want to get bogged down by the mathematical details of Markov Chains - learning by building an application is where all the fun is!

In this blog, we will show how to build an application “Marky” that uses Markov chains to generate nonsensical tweets based on your twitter history. It uses Couchbase Server to store and process the data to generate these tweets.


Marky uses Couchbase Server views to process data
Marky’s map function is :
function (doc, meta) {
   if(doc.body) {
       var words = doc.body.split(/\s+/);
       if (words.length >= 1) {
           emit([null, words[0]], 1);
       }
       for(var i = 0; i < (words.length - 1); i++) {
           var pair = [words[i], words[i+1]];
           emit(pair, 1);
       }
   }
}

At a high-level, it splits text up into smaller chunks using a sliding window over 2 consecutive words and tries to regroup these chunks in correct order to form sentences based on a statistical weight. In the end, you get some nonsensical text that is fun to read.

For example : Given the input text “In this blog, we will show you how to build an application”, it will emit the Key,Value pairs -

Key                   Value

[null,"In"]           1
["In","this"]         1
["this","blog,"]      1
["blog,","we"]        1
["we","will"]         1
["will","show"]       1
["show","you"]        1
["you","how"]         1
["how","to"]          1
["to","build"]        1
["build","an"]        1
["an","application"]  1

To generate a word, we query the view using the last word we output. For example, to get candidates for a word to follow “the”, we use the query parameters startkey=["the"]&endkey=["the",{}]&group_level=2&reduce=true


This will get all the word pairs we outputted that start with “the”, group together pairs that are the same, and run the view’s reduce function on each group. Marky uses the built in reduce _sum, which will add together the values it is given. Running this on the database backing dkatz_ebooks yields:


Key                         Value
["the","#1"]                1
["the","100"]               1
["the","2"]                 1
["the","ability"]           3
["the","absolute"]          1
["the","answer"]            1
["the","app"]               1
["the","application"]       1
["the","area,"]             1
["the","background."]       1

To pick the word to output after “the”, we choose a word that follows it at random, but weight our choice based on the frequency of the word pair appearing in the input. That means we give “ability” has a 3/12 or 25% chance of being chosen here, where the rest of the words each have a 1/12 chance of being chosen or 8.3%.

Since at the beginning of a sentence, we pair the first word with null (for example: [null, “In”] in the earlier example), we can do the same query with null to begin a new output and get words likely to start a thought, or tweet, or whatever our input was. We also need to do this if we get unlucky and don’t get any candidate words back from the first view query. This could happen if the word in the query had only ever shown up at the end of the input texts we processed.


Marky Application

Marky uses a simple clojure wrapper built by the community. To setup marky, create a marky-config.clj file and point it to your Couchbase Server cluster and twitter account. Add some seed data, twitter user accounts or atom feeds and you're ready to launch the app.

{:bucket "default"
:pass ""
:cburl "http://localhost:8091/"
:twitter {:app-key "XXXXXXXXX"
          :app-secret "XXXXXXXXXX"
          :user-token "XXXXXXXX"
          :user-secret "XXXXXXXX"}
:jobs
[; :period, :after are in seconds, :ttl is in days.
 {:type :twitter :user "user-handle1" :period 3600 :ttl 60}
 {:type :twitter :user "user-handle2" :period 3600 :ttl 60}
 {:type :send-tweet :period 3600 :after 600}
 {:type :atom :url "http://some-domain/rssfeed.php" :period 86400 :ttl 60}]}

Here are some fun Marky tweets -

Want To Get Marky?

You can download the Marky source code here
You can also contribute to the clojure wrapper project here

Have Fun!

----

Thanks to Aaron for putting together the code in clojure.

Published at DZone with permission of its author, Baxter Denney. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)