It’s not about the size of your data, but since we’re all counting…

Maybe you’re new to the data world. In which case: tl;dr data size doesn’t matter but we still love to compare.

If you’re still chuckling, then perhaps you’ve noticed the proliferating number of #humblebrags on Hacker News about the size of someone’s data pipe. The headline usually goes something like this:

“How I ingested a $#*% ton of data and lived to blog about it”

Examples include Yelp’s latest article on a their 1 billion messages pipeline. Gold star Yelp, thanks for sharing! This seems like a pretty fat pipe, until you remember that a message is essentially just another kind of event — albeit with a slightly larger payload — which the good trolls of H/N were quick to point out. There are plenty of IoT networks that generate that amount of traffic in a single hour! (Ask us how we know.)

Segment recently put up a job posting for a Infrastructure Engineering position, and in it they linked to the following diagram saying “we ingest somewhere near the total number of Snapchats and Instagrams combined.”

Cool story Segment! That clocks in at about 1.2B events per day. At Treasure Data, we have customers doing that volume every hour! In fact, a few weeks ago we just passed 100 billion events per day, all ingested into the single largest multi-tenant Hadoop cluster in the world. Since then, we’ve grown even further: blowing past 1.5 million events per second last Friday. And that’s not even counting the volume handled by Fluentd — our open source pipeline technology deployed at companies like Atlassian, Microsoft, and thousands of companies running on AWS, Google Cloud Platform and Microsoft Azure.

I guess you could say we’re pretty serious about our data volumes too. But since the goal of all this data is to glean insights, what can we take away here?

  1. Everyone else’s data is smaller than you think

    In the title, I alluded to the size of your data not mattering. Something something about “it’s the way you use it” and such. This seems to be even more true than we expect. When Gartner loudly dropped the term “Big Data” from its hype curve last summer, it would seem they were, for once, ahead of a trend. You don’t need a large amount of data. You just need the right data, in the right places at the right time.

    If Segment, the darling of the point-to-point integration world, is processing just 1.2 billion events per day across hundreds of accounts, it would seem that people are finding a lot of value from some fairly small data sets — as long as they show up consistently in the right tools without engineering work.

    (Or it could be that Segment gets expensive at volume — $10,000 for every 1M monthly users — prompting fast growing startups to scramble for open source alternatives.)

  2. You don’t need big data to get big results

    There are applications for Big Data, and there will be many more in the always-kind-of-around-the-corner IoT future. But you don’t need big data volumes to get big insights. In fact, some of your most useful data is probably so small it’s hidden in plain sight. It’s the thousands of contacts in your marketing automation system, which, if you could only join against your Mixpanel product usage data, might actually tell you where your best users are coming from.

    Or it could be locked in your CRM silo, begging to be matched against 3rd party data to build look-a-like models and dramatically improve your PPC ROI. Or maybe it’s just pulling ERP data together with product analytics so you can see your margins broken down account by account. In all these examples, the utility lies not in size of the data, but in the value that’s unlocked when you bring multiple systems of record together on one analytics platform.

So fear not, ye of considerably smaller data! If you can bring it all together with the right tools, you too can unlock game changing insights for your business. You don’t need big data expertise, or a fancy engineering degree. You just need a tool that’s built from the ground up for business analytics, with friendly connectors into popular SaaS tools and powerful processing engines to run all of your cross-silo analytics. So stop enviously lurking on H/N and get your own data into the action today.

Request a demo
Paul Lacey
Paul Lacey
Paul Lacey is equal parts entrepreneur and engineer. He manages Product Marketing at Treasure Data, after several years at a social media startup that he co-founded himself. He loves big data, ruby-on-rails, and any blogging platform other than WordPress.
In Case You Missed It