High Performance SQL: AWS Graviton2 Benchmarks with Presto and Treasure Data CDP

High Performance SQL: AWS Graviton2 Benchmarks with Presto and Treasure Data CDP
Last modified: March 4, 2022

High Performance SQL: AWS Graviton2 Benchmarks with Presto and Treasure Data CDP

In December, AWS announced new Amazon EC2 M6g, C6g, and R6g instance types powered by Arm-based AWS Graviton2 processors. It is the second Arm-based processor designed by AWS following the first AWS Graviton processor introduced in 2018. Graviton2-based M6g instances deliver up to 40 percent better price/performance compared with the current generation of M5 instance types. In this blog, we’ll explore the question: Can we achieve that price/performance for real-world applications?

How Could the Graviton2 M6g Change the Performance of Data Applications?

Treasure Data operates a customer data platform (CDP) running on top of AWS. This big data heavy workload runs every day, supported by open-source middleware such as Presto. To evaluate the new M6g instances we need a workload environment close to our real-world use case. We evaluated M6g using a CPU intensive workload by running Presto workloads in Docker containers.

The same experiment was already performed for the first generation AWS Graviton processor by the Presto community. With the announcement of AWS Graviton2, we’re excited to revisit the experiment using M6g to see what it can do.

How We Tested: Enabling Presto Docker Image Support for aarch64

Since Presto is a Java application, it is not difficult to run it on multiple platforms, including aarch64 architecture. It is easy to build an image that supports Arm using the Docker buildx toolkit. We used an image containing Presto 327-SNAPSHOT with the patch to remove the code for verification of the system requirement in Presto (documented here). The Docker images used in this benchmark experiment are available in the Docker hub here:

lewuathe/presto-coordinator:327-SNAPSHOT-aarch64
lewuathe/presto-worker:327-SNAPSHOT-aarch64

For simplicity, we use docker-compose to launch a multi-node Presto cluster in an instance. This is sufficient to compare the CPU intensive workload on each instance type. We construct the docker-compose environment to run the experiment in the instance as follows.

# Install Docker
$ sudo yum update -yr
$ sudo amazon-linux-extras install docker -yr
$ sudo service docker startr
$ sudo usermod -a -G docker ec2-user

Get Treasure Data blogs, news, use cases, and platform capabilities.

Thank you for subscribing to our blog!

# Install docker-composer
$ sudo yum install python2-pip gcc libffi-devel openssl-devel -yr
$ sudo pip install -U docker-compose

Benchmark Specifications

We use the TPC-H connector of Presto to measure performance, here are the specifications to reproduce the experiment:

One coordinator plus two worker processes run by docker-compose on a single instance.
We compare AWS Graviton2 performance using m6g.4xlarge and the equivalent sized current generation m5.4xlarge.
We use q01, q10, q18, and q20 runs on the TPCH connector. Since the Presto TPCH connector does not access external storage, we can measure pure CPU performance without worrying about network variance. Here are the characteristics of each query to give readers pointers to understand the performance property.
- q01: Simple aggregation using SUM, AVG and COUNT
- q10: Aggregation plus Sort
- q18: SemiJoin with IN operator
- q20: Nested Subquery
We choose tiny and sf1 as the scaling factor of the TPCH connector
In each case our experiment performs five warm up queries and then measures the average runtime of five queries to determine the performance.
We use OpenJDK 11 provided by arm64v8/openjdk:11.

Here is the result of the benchmark across five different instance types and two different sized data sets. The y-axis represents the runtime of each query type scaled in milliseconds. Smaller is better.

Figure 1Figure 1 shows the notable results for m6g instances, because the benchmark we use is a CPU-intensive workload. The performance difference is significant in the larger data set (sf1), where we observe m6g.4xlarge is up to 30 percent faster than m5.4xlarge, with the same vCPU and memory configurations.

Promising Results: 30 Percent Faster, 20 Percent Cheaper, Up to 50 Percent Better ROI

As we have shown, m6g.4xlarge is up to 30 percent faster than the current generation m5.4xlarge instance type. Considering m6g.4xlarge is also 20 percent lower cost than m5.4xlarge, we can achieve up to 50 percent better ROI in total. That is a significantly promising result for users running similar compute heavy workloads on AWS, and we look forward to AWS Graviton2 becoming commercially available in the near future!

Kai Sasaki

Kai Sasaki is a software engineer at Treasure Data, which provides an award-winning enterprise customer data platform (CDP). He is working on maintaining several distributed systems to support business-critical data analysis and decisions by marketers in real-time. The platform is powered by AWS and much open-source software like Presto.