Takeaways from the Kafka Talk at AirBnB: the Power of Structured Data and the Myth of “Exactly Once”

Takeaways from the Kafka Talk at AirBnB: the Power of Structured Data and the Myth of “Exactly Once”
Last modified: May 9, 2022

Takeaways from the Kafka Talk at AirBnB: the Power of Structured Data and the Myth of “Exactly Once”

The talk was packed. With almost twice as many attendees as there were seats, it was obvious Kafka is gaining serious traction among Bay Area start-ups. Two topics from the talk were especially illuminating from my perspective.

Structure Your Data
In the talk, Jay mentioned LinkedIn’s data pipeline used to be pretty brittle, minor format changes in application code propagating throughout the data pipeline and breaking the Hadoop backend. Since then, they have adopted Avro to keep all of their data structured and well-typed. Today, any code that adds data to their data pipeline goes through a schema check-in followed by a thorough code review.

Like Jay, we strongly believe in always keeping data structured (see our blog entry). Sure, JSON does not have Avro’s schematic rigor, but similarities are much greater than differences. Whether it is Avro, JSON, MessagePack or Protobuf, maintaining structure throughout is essential for creating a robust data pipeline.

Last night, I attended Jay Kreps‘s talk on Apache Kafka at AirBnB. Jay is a Principal Engineer at LinkedIn and is one of the original authors of Kafka.

The Myth of “Exactly Once”
The holy grail of messaging systems is “exactly once”, meaning that every message is always delivered (“at least once”) and never duplicated (“at most once”). And just like any other thing “holy grail”, it’s pretty unrealistic without major drawbacks.

While I cannot remember the exact line, Jay remarked how most systems that boast to have an “exactly once” guarantee come with a dubious footnote that goes something like “it is exactly once as long as consumers do not go down”. He went on to say that while exactly once semantics is not impossible (for example, with two-phase commits), it is not often worth it because it results in reduced performance and availability.

It was refreshing to hear a leading expert in implementation of distributed systems clarify the myth around exactly once semantics. As the original author of the distributed log collector Fluentd, Treasure Data also bears the responsibility of educating people what’s feasible and realistic in the current state of distributed systems.

Get Treasure Data blogs, news, use cases, and platform capabilities.

Thank you for subscribing to our blog!

Kiyoto Tamura

Kiyoto began his career in quantitative finance before making a transition into the startup world. A math nerd turned software engineer turned developer marketer, he enjoys postmodern literature, statistics, and a good cup of coffee.