4 Ways AdTech Uses Treasure Data’s Open Source Software
Like many cloud services, Treasure Data has been a huge beneficiary of open source software. For example, we have adopted and modified Hive and Presto for our distributed processing back end and made several key contributions back to Presto. We run dozens of Chef recipes everyday, and Rails and Angular.js power our console.
At the same time, we make a point of open-sourcing our software where it makes sense. Because of Treasure Data’s scalability requirements (we ingest approximately 40 billion events for our customers everyday), our open source projects are designed for performance and taken up by industries where performance is crucial.
One such area is adtech and web services, where advertisement drives revenue. To showcase how and where our open source projects are used, I want to share a couple of our projects with use cases in adtech.
Fluentd at Intent Media
Fluentd is a data collector to help data engineers build the unified logging layer and improve the data pipeline. Open sourced in 2011, it has been adopted on Google Cloud Platform and used at companies like Nintendo and Change.org.
Recently, Intent Media blogged about how they use Fluentd to unify their in-stream metric calculations and log search and archiving for nginx access logs.
MessagePack at Pinterest
You might not think of Pinterest as an adtech company, but their “promoted pins” are a powerful vehicle for intent-based advertising. And guess what, Pinterest has been a long-time user of our open source project MessagePack. MessagePack is like JSON, but it is fast and has a small resource footprint. Originally conceived by Treasure Data co-founder and lead engineer Sada Furuhashi, MessagePack is used widely as a data exchange format in performance-sensitive scenarios.
Hivemall is a suite of scalable machine learning algorithms implemented as Hive UDAFs/UDTFs. It was conceived by our research engineer Makoto Yui to make ML algorithms more accessible to SQL practitioners while taking advantage of MapReduce’s scalability.
Hivemall has a small (for now!) but passionate fan base of people in adtech who use it optimize their bidding algorithms. We wrote a blog about how to use Hivemall to optimize CTR rates with real-time data.
Embulk is our newest open source project that makes it easy to move massive data across systems (such as S3, FTP Server, MySQL, PostgreSQL, etc.). In adtech, it’s often necessary to acquire data from third party data providers to enrich your data. This data is usually made accessible via clunky APIs or CSVs on FTP servers, and Embulk is an effective tool to acquire, parse and store such data into your storage systems.
Treasure Data and AdTech
Thanks to our open source projects, several adtech companies including MobFox and JustPremium have become Treasure Data customers. If you are looking to quickly build your high performance data pipeline, give us a shout.