The talk was packed. With almost twice as many attendees as there were seats, it was obvious Kafka is gaining serious traction among Bay Area start-ups. Two topics from the talk were especially illuminating from my perspective.
Structure Your Data
In the talk, Jay mentioned LinkedIn’s data pipeline used to be pretty brittle, minor format changes in application code propagating throughout the data pipeline and breaking the Hadoop backend. Since then, they have adopted Avro to keep all of their data structured and well-typed. Today, any code that adds data to their data pipeline goes through a schema check-in followed by a thorough code review.
Like Jay, we strongly believe in always keeping data structured (see our blog entry). Sure, JSON does not have Avro’s schematic rigor, but similarities are much greater than differences. Whether it is Avro, JSON, MessagePack or Protobuf, maintaining structure throughout is essential for creating a robust data pipeline.
The Myth of “Exactly Once”
The holy grail of messaging systems is “exactly once”, meaning that every message is always delivered (“at least once”) and never duplicated (“at most once”). And just like any other thing “holy grail”, it’s pretty unrealistic without major drawbacks.
While I cannot remember the exact line, Jay remarked how most systems that boast to have an “exactly once” guarantee come with a dubious footnote that goes something like “it is exactly once as long as consumers do not go down”. He went on to say that while exactly once semantics is not impossible (for example, with two-phase commits), it is not often worth it because it results in reduced performance and availability.
It was refreshing to hear a leading expert in implementation of distributed systems clarify the myth around exactly once semantics. As the original author of the distributed log collector Fluentd, Treasure Data also bears the responsibility of educating people what’s feasible and realistic in the current state of distributed systems.
- Fueling 2016 with Team Growth (Hint: We’re Hiring)The Muse recently ranked Treasure Data as one of the top 15 companies to work for this year. It's a great honor to be recognized, and we believe our company stands out from other startups based on our team, our products, and our core values that shine across everything that we do as a company. O...
- At 100x the Scale of Twitter Firehose, Treasure Data Wins 1st Place at DEMO TractionLast week, I was at DEMO Traction Boston, a new conference focused on connecting high growth customers with prospective customers. Each company was given 4 minutes to present their company and the _traction_ they are getting. And I am pleased to announce that Treasure Data came in the first place...
- Data-Driven Sales with AI: Meetup with Anand Kulkarni of Lead GeniusOn Wednesday, September 16, starting at 6.30, Heavybit (325 9th Street, San Francisco) will host a meetup featuring Anand Kulkarni of Lead Genius and sponsored by Treasure Data: "Data-Driven Sales: building artificial intelligence that searches & learns." You can RSVP here. Anand is Chief Sci...