Fluentd: The Missing Log Collector Software

Fluentd: The Missing Log Collector Software

We’d like to introduce you to Fluentd, an open-source log collector software developed at Treasure Data, Inc.

The Problems

The fundamental problem with logs is that they are usually stored in files although they are best represented as streams (by Adam Wiggins, CTO at Heroku). Traditionally, they have been dumped into text-based files and collected by rsync in hourly or daily fashion. With today’s web or mobile applications, this creates two problems.

Problem 1: The need for Ad Hoc Parsing

The text-based logs have their own format, and an analytics engineer needs to write a dedicated parser for each format. But that’s probably not the best use of their time. They should be analyzing data to make better business decisions instead of writing one parser after another.

Problem 2: Lacks Freshness

The logs lag. The realtime analysis of user behavior makes feature iterations a lot faster. A more nimble A/B testing will help you differentiate the service from competitors.

This is where Fluentd comes in. We believe Fluentd solves all issues of scalable log collection by getting rid of files and turning logs into true semi-structured data streams.

What is Fluentd?

The best way to describe Fluentd is by saying ‘syslogd that understands JSON’. The notable features are:

Get Treasure Data blogs, news, use cases, and platform capabilities.

Thank you for subscribing to our blog!

  • Easy installation by rpm/deb/gem
  • Small footprint with 3000 lines of Ruby
  • Semi-Structured data logging
  • Easy start with small configuration
  • Fully pluggable architecture, and plugin distribution by Ruby gems

Other similar systems are Facebook’s Scribe and Cloudera’s Flume. Here is a table to summarize the differences among Scribe, Flume, and Fluentd. (Note: I don’t know much about next-generation Flume NG branch, but a big movement is happening with Flume!

Of course, there are pros and cons. Fluentd takes maximum extensibility and flexibility over Ruby’s eco-system, while Scribe takes the performance (although Fluentd is pretty fast too. It can handle 18000msgs/s per core). Flume is powered by Java and therefore integrates natively with many enterprise systems. The following sections describe the basic concepts of Fluentd in more detail.

LogEntry = time + tag + record

Unlike traditional raw-text log, the log entry of Fluentd consists of three entities: time, tag, and record.

  • The time is the UNIX timestamp when the logs are posted.
  • The tag is used to route the message in log-forwarding, which is described later.
  • The record is represented as JSON, not raw text.

The record is intentionally represented as JSON. Fluentd is designed to collect semi-structured data, not unstructured data. This means no parsing is required at the later analysis pipeline. It’s easy to handle, and faster than ad-hoc regexp. But the application needs to use the logging library for fluentd.

Internal Architecture: Input -> Buffer -> Output

Fluentd consists of three basic components: Input, Buffer, and Output. The basic behavior is 1) Feeding logs from Input, 2) Buffers them, and 3) Forward to Output.

Input

Input is the place where the log comes in. The user can extend it to feed the events from various sources. The example Input supported officially includes: HTTP+JSON, tailing files (Apache log parser is supported), syslog. Of course you can add Input plugin by writing a Ruby plugin.

Buffer

Buffer exists for reliability. When the Output fails, the events are kept by Buffer and automatically retried. Memory or File Buffer is supported now.

Output

Buffer creates chunks of logs, and passes them to the Output. Output stores or forwards chunks. The buffer waits several seconds to 1 minute, to create chunks. This is really efficient for writing into the storage which supports batch-style importing.

Many Input/Output plugins are under heavy development in the community: MongoDBRedisCouchDBAmazon S3Amazon SQSScribe0MQAMQPDelayedGrowl, etc.

Log Forwarding

Fluentd works well with one-node, but it can have multi-node configuration.

The application servers have one Fluentd locally, and it forwards the local logs into another Fluentd, which aggregates all the logs into one place. The tag is used to determine the destination Fluentd (static configuration by config files).

Conclusion

Fluentd makes real-time log collection dead simple. Out of the possible solutions, we believe Fluentd is easiest to install, configure, extend, and perform well.

Of course it’s an early-stage product compared to Scribe and Flume, but we already have some users aggregating tens of millions of daily logs using Fluentd. The number of committers and plugins are increasing everyday.

If you want you want to learn more visit:

Treasure Data
Treasure Data
Treasure Data is the only enterprise Customer Data Platform (CDP) that harmonizes an organization’s data, insights, and engagement technology stacks to drive relevant, real-time customer experiences throughout the entire customer journey. Treasure Data helps brands give millions of customers and prospects the feeling that each is the one and only. With its ability to create true, unified views of each individual, Treasure Data CDP is central for enterprises who want to know who is ready to buy, plus when and how to drive them to convert. Flexible, tech-agnostic, and infinitely scalable, Treasure Data provides fast time to value even in the most complex environments.