Baruch Sadogursky, recently joined JFrog as the Developers Advocate following years of working alongside JFrog’s founding team. Prior to joining JFrog, Baruch was an innovations expert with BMC Software Incubator team after 6 years with AlphaCSP as a senior Java consultant, architect and training division manager. Baruch is hacking around Java technologies and Continuous-Integration tools since 2001, including module development for open source projects like Gradle & Spring. Baruch is also active in community development around Artifactory, participating in the development of it’s plugin ecosystem and enriching it’s functionality with open-source user plugins. As JFrog’s Developers Advocate, Baruch contributes to the strong collaboration with leading open-source projects such as SpringSource, Grails and Gradle by providing them with the Artifactory Cloud platform, and fuels the Continuous-Integration ecosystem with open-source plugins for leading tools such as Jenkins, TeamCity & Bamboo. Baruch blogs at http://blogs.jfrog.org & blog.sadogursky.com and tweets as @jbaruch. Baruch is a DZone MVB and is not an employee of DZone and has posted 17 posts at DZone. You can read more from them at their website. View Full User Profile

Replication! What and How.

06.15.2012
| 1514 views |
  • submit to reddit

Working in distributed teams isn’t easy. There are time zone differences, language and cultural differences, and… data distribution. When the data you need is away, you are miserable. So, let’s fix it.

Bring your data home

Let’s take binary repository, for example. Two types of binaries are stored there: your build outputs, and third-party libraries proxied from public servers. Let’s say now you are working in multi-site environment, where different teams in different locations depend on each other’s binaries. So, both of your teams access remote repositories and download their third-party libraries from public repositories (repo1, java.net, SpringSource, etc.). It looks like the other team’s repository is no different from any other remote server. But is that so? Let’s compare:

Feature / Repository Type Any public repo Other team’s repo 
 Access Frequency Low (only on adding new or updating libraries) High (Intra-project snapshots)
 Unneeded artifacts ManyAlmost none 

 So, bringing one artifact at a time, on demand, is fine for a public repository, but wrong for inter-project dependencies. What’s the cure? Replication.

The idea is to bring the artifacts from a remote repository before they are actually needed, assuming the time savings on ahead of time downloads are bigger than the loss of bringing in few unneeded artifacts. Now each team has its local copy of the content. And look, it feels like backup (restore from the other server) and like high availability (if one is down, use the other), too! Once you do the replication, you get backup and HA for free!

There are two ways of doing so: either push the artifacts to a remote server when the repository content is changed (by event), or transfer all the recent changes as a batch once in a while. Let’s take a deeper look at both approaches.

Event-driven replication

This seems to be the natural way, isn’t it? Once something happened, you just send the event to the other server, and voila, both servers are up-to-date almost in real time and you always transfer the bare minimum: only what was changed. But here we go with the limitations: 

  • If you have an existing server, what do you do? Which events will sync all the existing data from your server to the new instance?
  • If the event transfer failed without feedback, then what? You know, the network is unreliable. How can you be sure you aren’t missing events?
  • What happens when there is a network split? For how long do events need to be queued until the other side reappears? Many snapshot artifacts may no longer be valid by this time.
  • Event-driven replication always requires pushing data to the other server. And, as you know, pushing and firewalls don’t get along so nicely.

 

 Quite a bunch problems to solve. You can do that or look at the other type of replication – a scheduled one.

Scheduled replication

This one is simple. Just setup the periods and one server will push data to another one. Can’t push? No problem. Pull then. This plain setup eliminates the problems of event-driven replication. Here’s how:

  • Adding a new server – the replication is triggered manually.
  • Transfer failed without feedback – just retry the replication.
  • Pushing and firewalls aren’t getting along so nicely – since the replication can be started on any end, pull the data from the server instead of pushing it.
 

You can think about pull replication as your regular remote proxying repository, on steroids. Instead of patiently waiting for the user to request an artifact and only then going to the remote server to find it, your server takes action and fetches the artifacts in advance to pre-populate the cache. Once the assumptions from table above are correct, it makes sense.

Overall, scheduled replication looks impressive. Where’s the catch? Well, there are two major flaws: 
  • The first is timing. When using event-driven replication, the servers are synced almost immediately after a change.
  • The second (and bigger) problem is determining what to transfer. Should we keep some kind of log to determine the deltas? Should we calculate the deltas during the replication? How? By change log? By file content? By file name? All those approaches have clear disadvantages. The log might be incompatible with changes on the other server, deltas by content won’t work on huge binaries, and deltas by file name won’t take copies under consideration. But here’s a nice trick: once you know the files’ checksums, delta calculation isn’t a problem anymore. You can decide if the file needs to be transferred to another server in no time!

So, to sum up: event-driven replication is (almost) immediate and the content to replicate is obvious. Scheduled replication supports both push and pull modes, is event-independent and firewall-friendly, but the content calculation might be a problem (unless you have checksums).

Replication in action (how we did it in Artifactory)

Fortunately, we at JFrog selected the right type of binary storage from the start.Artifactory features checksum storage. This means we know the checksums for all our files at all times, and use this data in many ways, for example to prevent local file duplications. It was only natural for us to go with scheduled replication, enjoying all its benefits without suffering from its major flaw.

And, indeed, Artifactory now offers great replication support and includes the following features:

  • Checksum-based content transfer. Only files with unmatched checksum are passed over the wire. It saves time and money.
  • Support for both push and pull modes. To provide a firewall-friendly connection, use pull. To enforce the artifacts on another server, use push.
  • Replication of metadata, not only artifacts. It includes all kinds of metadata (e.g. maven, user-created properties, etc.), and it also uses the checksum logic.
  • Streaming transfer for superb performance.

You can start using the replication today. Here's a simple and intuitive user guide.

What about the replication timing, you might ask? Well, once the heavy lifting of bare-bones replication is done, adding the events triggers on top of it isn't a problem. We are working on it right now. Checksum-based storage makes it easy (again).

As you can see, checksum-based storage lets you enjoy the worlds of scheduled replication and event-driven replication without sacrificing bandwidth, latency, storage space, and computing power. Want to see for yourself? Download a trial of Artifactory Pro (actually, it takes two for replication), and take it for a ride.

Enjoy your build!
Published at DZone with permission of Baruch Sadogursky, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)