Treasure Data's Big Data as-a-Service cloud platform enables data-driven businesses to focus their precious development resources on their applications, not on mundane, time-consuming integration and operational tasks. Our pre-built, multi-tenancy cloud platform is already in use by over 50 customers worldwide and is managing more than 200 billion rows of data and processing 130,000 jobs per day. Discover how Treasure Data can help you focus on your core business and benefit from the fastest time-to-answer service available. Sadayuki is a DZone MVB and is not an employee of DZone and has posted 27 posts at DZone. You can read more from them at their website. View Full User Profile

Takeaways from the Kafka Talk at AirBnB: the Power of Structured Data and the Myth of “Exactly Once”

  • submit to reddit

Last night, I attended Jay Kreps’s talk on Apache Kafka at AirBnB. Jay is a Principal Engineer at LinkedIn and is one of the original authors of Kafka.

The talk was packed. With almost twice as many attendees as there were seats, it was obvious Kafka is gaining serious traction among Bay Area start-ups. Two topics from the talk was especially illuminating from my perspective.

Structure Your Data

In the talk, Jay mentioned LinkedIn’s data pipeline used to be pretty brittle, minor format changes in application code propagating throughout the data pipeline and breaking the Hadoop backend. Since then, they have adopted Avro to keep all of their data structured and well-typed. Today, any code that adds data to their data pipeline goes through a schema check-in followed by a thorough code review.

Like Jay, we strongly believe in always keeping data structured (see our blog entry). Sure, JSON does not have Avro’s schematic rigor, but similarities are much greater than differences. Whether it is Avro, JSON, MessagePack or Protobuf, maintaining structure throughout is essential for creating a robust data pipiline.

The Myth of “Exactly Once”

The holy grail of messaging systems is “exactly once”, meaning that every message is always delivered (“at least once”) and never duplicated (“at most once”). And just like any other thing “holy grail”, it’s pretty unrealistic without major drawbacks.

While I cannot remember the exact line, Jay remarked how most systems that boast to have an “exactly once” guarantee come with a dubious footnote that goes something like “it is exactly once as long as consumers do not go down”. He went on to say that while exactly once semantics is not impossible (for example, with two-phase commits), it is not often worth it because it results in reduced performance and availability.

It was refreshing to hear a leading expert in implementation of distributed systems clarify the myth around exactly once semantics. As the original author of the distributed log collector Fluentd, Treasure Data also bears the responsibility of educating people what’s feasible and realistic in the current state of distributed systems.


Published at DZone with permission of Sadayuki Furuhashi, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)


Mark Unknown replied on Mon, 2012/09/03 - 1:44pm

"The Myth of 'Exactly Once'" or " exactly once semantics is not impossible" - Which is it? It seems what you have to do is choose between extreme speed and guaranteed single delivery.   What you can live with in LinkedIn is not the same as, lets say, banking.

Rickard Oberg replied on Mon, 2012/09/03 - 6:05pm

I thought exactly once delivery was pretty easy to implement. Just use Atom feeds. What am I missing?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.