Dmitriy Setrakyan manages daily operations of GridGain Systems and brings over 12 years of experience to GridGain Systems which spans all areas of application software development from design and architecture to team management and quality assurance. His experience includes architecture and leadership in development of distributed middleware platforms, financial trading systems, CRM applications, and more. Dmitriy is a DZone MVB and is not an employee of DZone and has posted 57 posts at DZone. You can read more from them at their website. View Full User Profile

Collocation: The First Rule of Distributed Programming

  • submit to reddit

I am pleased to announce that we just recently released GridGain 3.0.4. The last couple of releases have been focused, among other things, around convenient and effective collocation of computations and data, and also grouping of data that is usually accessed together on the same nodes. Sending computations exactly to the nodes where the accessed data is residing is one of the key components in achieving better scalability. Without collocation, nodes fetch various data from other nodes for brief periods of time, just to perform often a quick computation and discard it almost immediately thereafter. This creates unnecessary data traffic, a.k.a. data noise, and can at times bring a server to its knees.

In my previous blog post I showed how to collocate computations and data using direct API via GridCache.mapKeyToNode(..) method. We have also added analogous methods on Grid API to provide capability of finding data affinity on the nodes that do not cache any data themselves. In our latest 3.0.4 release we have also added a very convenient way to provide collocation via @GridCacheAffinityMapped annotation.

Say you have 2 types of objects, Person and Company. Multiple persons can work for the same company. This means that you generally may wish to access Person objects together with the Company for which they work. To do that in a scalable fashion, you may wish to ensure that all people working for the same company are cached on the same node. This way you can send computations to that node and access multiple people from the same company locally. Here is how it can be done in GridGain.

    public class PersonKey {  
// Person ID used to identify a person.
private String personId;

// Company ID which will be used for data affinity.
private String companyId;
// Instantiate person keys with same company ID.
Object personKey1 = new PersonKey("myPersonId1", "myCompanyId");
Object personKey2 = new PersonKey("myPersonId2", "myCompanyId");

// Both, the company and the person objects will be cached on the same node.
cache.put("myCompanyId", new Company(..));
cache.put(personKey1, new Person(..));
cache.put(personKey2, new Person(..));
Now, if you want to perform a computation which involves multiple people working for the same company, all you have to do is send a grid job to the node where those people are cached. Here is how you would send a computation to the node which caches all people for the company with ID "myCompanyId".

G.grid().run(GridClosureCallMode.BALANCE, new Runnable() {
// This annotation specifies that computation should be routed
// precisely to the node where all objects with affinity key
// of 'myCompanyId' are cached.
private String companyId = "myCompanyId";

@Override public void run() {
// Some computation logic here.
Now, when you properly collocate all your data within your data grid and then route your computations to the nodes where your data is cached, all cache operations become LOCAL, hence achieving best performance and scalability without any data noise. Kind of goes inline with the first rule of distributed programming, which is DO NOT DISTRIBUTE.



Published at DZone with permission of Dmitriy Setrakyan, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)