Amazon Redshift Performance – Bigger Clusters, or Bigger Nodes?

Last week, I looked into options for increasing the performance of an Amazon Redshift cluster that was currently using 10 large dense compute nodes (dc1.large). While investigating, I noticed that a cluster of 32 dc1.large nodes (the maximum for that node type) had the same number of CPUs, the same amount of storage, and a comparable amount of RAM to a cluster of 2 dc1.8xlarge nodes (the minimum for that node type), while offering significant cost savings for anything less than a 3-year term. This got me wondering… if you want a cluster that’s around that size, which is a better bet for performance?

We’ll get the answer out of the way early for the tl;dr crowd.

The results of testing out three diverse queries on various-sized Redshift clusters.

The cluster of 32 dc1.large nodes provided substantially (13%, on average) faster performance for all but the shortest query in my test batch. Given the superior performance and lower cost, it was the obvious choice for a cluster of this size. Whenever more performance is called for, it will make sense to switch to a cluster of 3 dc1.8xlarge nodes…

For those interested in how I conducted the tests, I basically created a new Redshift cluster (so that there would be nobody but me using it), restored a snapshot of the database in question, and ran my test queries (three times each, so I could get a nice average of time to execute). Then I resized the cluster, ran a vacuum and analyze just to be safe, and tried out the queries again. Wash, rinse, repeat.

While my suite of three test queries is by no means comprehensive, I did try to incorporate some variety, while also sticking to real-world queries we were using around the business.

  • Query 1 was a relatively long operation that involved a number of joins between large tables and a whole bunch of grouping and floating-point math.

  • Query 2 was a quicker operation, but one that still involved a bunch of joins, grouping, and floating-point math.

  • Query 3 was long-running again, but with very few joins - mainly just crunching a bunch of data on one massive table.

Of course, the one thing that’s noticeably absent from this list is a truly long-running query - the kind that runs for 30 minutes… or a couple of hours. As my readers will no doubt understand, I didn’t want to sit around and run hour-long queries three times over for the better part of a couple days.

As much as I’d love to share the underlying queries with you, they’re proprietary, so you’ll have to take my word on their contents and performance. With that said, all told, the entire process of carrying out these tests cost us somewhere in the vicinity of $60, so it’s definitely something that would be easy to do if you want to try it out with your own data and queries.

Ultimately, this information was enough to demonstrate that a cluster of 32 dc1.large nodes was a better option in this instance, but your mileage may vary. I’d be interested to hear if anybody else tries this out… a bit of Googling didn’t seem to turn up too many people who had wrestled with this issue…

As a side note, it’s interesting that Amazon Redshift performance doesn’t scale linearly with cluster size. Going from 10 nodes to 30 nodes will triple your costs, but only roughly double your query performance…