Takeaways from the Kafka Talk at AirBnB: the Power of Structured Data and the Myth of “Exactly Once”
The talk was packed. With almost twice as many attendees as there were seats, it was obvious Kafka is gaining serious traction among Bay Area start-ups. Two topics from the talk were especially illuminating from my perspective.
Structure Your Data
In the talk, Jay mentioned LinkedIn’s data pipeline used to be pretty brittle, minor format changes in application code propagating throughout the data pipeline and breaking the Hadoop backend. Since then, they have adopted Avro to keep all of their data structured and well-typed. Today, any code that adds data to their data pipeline goes through a schema check-in followed by a thorough code review.
Like Jay, we strongly believe in always keeping data structured (see our blog entry). Sure, JSON does not have Avro’s schematic rigor, but similarities are much greater than differences. Whether it is Avro, JSON, MessagePack or Protobuf, maintaining structure throughout is essential for creating a robust data pipeline.
The Myth of “Exactly Once”
The holy grail of messaging systems is “exactly once”, meaning that every message is always delivered (“at least once”) and never duplicated (“at most once”). And just like any other thing “holy grail”, it’s pretty unrealistic without major drawbacks.
While I cannot remember the exact line, Jay remarked how most systems that boast to have an “exactly once” guarantee come with a dubious footnote that goes something like “it is exactly once as long as consumers do not go down”. He went on to say that while exactly once semantics is not impossible (for example, with two-phase commits), it is not often worth it because it results in reduced performance and availability.
It was refreshing to hear a leading expert in implementation of distributed systems clarify the myth around exactly once semantics. As the original author of the distributed log collector Fluentd, Treasure Data also bears the responsibility of educating people what’s feasible and realistic in the current state of distributed systems.