12 Open Source Software Innovations from Treasure Data Engineers
Treasure Data is proud to have some of the best technical minds in the world working on our unique managed service. When they’re not working on the Treasure Data Service or supporting our customers, many of our engineers continue to support technological innovation by contributing to open source projects.
I’ve been an open source committer since the age of 18, and I learned programming from the open source community. Quite naturally, I’m a big believer in open source software. Our entire engineering team loves it, too, and is responsible for many open source software advances. I want to introduce you to 12 open source software inventions by Treasure Data engineers.
Sada developed Fluentd, an open source data collector for a unified logging layer. Fluentd is now used by a variety of companies, including Nintendo, LINE and PPTV. Treasure Data sponsors this project, and Masa is now maintaining the software.
2) MessagePack (Ruby, Java, C/C++, D)
Sada also developed MessagePack, an efficient binary serialization format. It lets you exchange data among multiple languages like JSON. But it’s faster and smaller. Small integers are encoded into a single byte, and typical short strings require only one extra byte in addition to the strings themselves. Companies such as Pinterest use it. Muga & Mitsu contributed quite a bit to MessagePack-Java. Masa developed MessagePack-D, and eventually joined Treasure Data as a backend engineer.
Leo developed the snappy-java library. This is a Java library that uses the Snappy library, Google’s fast compressor/decompressor written in C++. This popular library is downloaded more than 100,000 times a month from the Maven central. Recently, Apache Spark adopted snappy-java for optimizing the performance and memory usage of shuffling stages.
Nahi developed a Ruby library called httpclient. His presentation “Ruby HTTP clients comparison” at RubyConf 2012 describes the differences between multiple HTTP client libraries. Treasure Data’s Ruby client library, of course, uses his library :). He’s also a contributor to Ruby and JRuby projects.
5) GNU COBOL (OpenCOBOL)
COBOL?!? Yes, COBOL, the programming language. Keisuke designed and led the development of the open-source COBOL compiler. This code base has now become the basis of the GNU COBOL project. The history is described on Wikipedia.
6) GNU Guile 2.0’s Virtual Machine
Not only did he design COBOL, Keisuke also contributed a lot to Scheme. He wrote the first version of Virtual Machine (VM) for GNU Guile 2.0, the scheme interpreter by GNU. His code base was revived after somewhere between 8-10 years, and has finally merged with help from committers.
Min developed JeroMQ. JeroMQ is a pure Java porting of ZeroMQ library written in C and C++. JeroMQ is faster than jzmq, which is a jni-based implementation.
Cesar and Jake developed Treasure Data’s internal library to do the server-side form validation with AngularJS, and published as ‘angular-server-form‘. This provides a directive and service to simplify server-side validation, and automatic propagation of server-side errors on your forms.
9) Huahin Framework
Ryu developed Huahin Framework, a framework to simplify the development and management of Hadoop MapReduce jobs. Huahin Manager provides the REST API proxy to interact with MapReduce system with API calls.
Yuu developed pyenv, which lets you easily switch between multiple versions of Python. It’s simple, unobtrusive and follows the UNIX tradition of single-purpose tools that do one thing well.
Leo also developed sbt-pack. sbt-pack is a sbt plugin for creating a distributable package of a Scala project. It collects all of the dependent libraries (jar files) and generates launch scripts that can run your Scala programs with ease. Here is a comment in stackoverflow: ‘sbt-pack is the cleanest way to pack the project and dependencies‘.
I’m maintaining the Ruby gem called unicorn-worker-killer. Unicorn is a widely used HTTP-server for Ruby Rack applications. One thing we thought Unicorn lacked was the ability to kill Unicorn workers based on the number of requests and consumed memory. This gem is originally inspired by hotchpotch’s work of safely killing unicorn workers by using out-of-bound GC like mechanism in Unicorn. After fixing some problems and running the code in production for a while, I published it as a gem.
In addition, we’re contributing to other projects, such as Hadoop / Presto / Hive / JBoss Javassist / MongoDB / Ruby / JRuby, etc. If you’re interested in working with us, please let us know!