Treasure Data’s Plazma: Columnar Cloud Storage

Treasure Data’s Plazma: Columnar Cloud Storage

tumblr_inline_mor2ewAPMv1qz4rgp Treasure Data has been developed by Hadoop experts. We get Hadoop, and, in many ways, it’s part of our core. As we have built out the platform, we noticed that the storage layer needs to be multi-tenant, elastic, and easy to manage while keeping the scalability and efficiency. This led us to create Plazma, our own distributed columnar storage system in place of HDFS. We wanted to leverage the “store everything now, analyze later” model of our schema-less architecture and provide better performance in terms of storage and query processing.

By separating the MapReduce processing engine of Hadoop and the storage layer, we would be able to optimize the elasticity, efficiency, and reliability of the system. Making our system more modular also allowed us to use columnar storage for our data and allow queries to only parse through the relevant records instead of reading the whole dataset. Plazma led us to process the queries faster, manage databases more simply, and make better use of our schemaless database architecture.

We achieved our technical goals by architecting Plazma in the following ways:

  • JSON processing: automatically converts row-based JSON objects into a columnar format
  • Columnar storage: uses a columnar file storage format which significantly reduces disk IO for analytical queries
  • IO optimizations: implements various IO optimizations such as parallel pre-fetch and background decompression
  • Scalability and ease management: Plazma is built on top of object-based storage, which is more easier to scale and maintain

These are some of the key innovations we made with Plazma to optimize query processing and storage and provide us with a more efficient distributed storage system solution. Some companies make the argument that leveraging HDFS allows for their business to take advantage of open source innovation, which is preferable to on-premise solutions. However, for our purposes, Plazma is much more efficient in terms of query processing and allows us to separate the processing and storage layers for optimizing query processing and manageability.

While this technology is currently proprietary to Treasure Data, we have discussed open sourcing it to provide developers with the tools they need for efficient distributed storage systems meant for big data analytics processing.

What do you think? Would you find this kind of technology useful and would you be interested in using it? Leave your thoughts in the comments.

Related Posts
  • 5 Geo-targeting Success Stories (and What You Can Learn From Them)Location, location, location. Smart marketers know that geotargeting represents a huge opportunity to increase customer conversion. Location-based advertising (LBA) has long been known to be an effective techni...
  • How to Create Four Different Customer Journey Maps (And Why You Might Need Them All)The most successful marketers understand how their customers arrive at a decision to buy—as well as how and where to meet a customer and become a trusted guide for the rest of their journey. Getting to kn...
  • The Data Nerd’s Guide to eTail EastAugust 19-22, eTail East comes to Boston. For retailers and marketers who are fascinated by data—and that should be all of us—this show offers such an embarrassment of riches, it's easy to get option paraly...