This post originally appeared on Medium. Pavel Kerbel is a Software Development Team Leader at ironSource.
This is the second part of my AWS Redshift series. If you missed Part 1, which is about fixing errors and investigating performance issues, you can read it here. Part 2 will focus on costs and how we can keep them to a minimum.
At ironSource, we use Redshift as our raw data storage. Here is a diagram of our data warehouse layer.
We use ironSource Atom to stream events in near real time into the cluster. Our BI analysts and data scientists then run an ETL on top of it and save the aggregated data in RDS MySQL, where our clients can enjoy their near-real time dashboards. When the size of our data started to grow, the number of nodes and the price we paid grew accordingly, and as you probably know, Redshift is not a cheap solution.
Here is what we did to cut costs and still keep our data accessible.
Redshift Optimizer Pro
Let’s start with the essentials:
If you create a table without proper encoding to your columns, raw data will be saved without any compression type and might take up much more space that it should.
During our experience with our friends at 365Scores, we managed to reduce a table size of 2.62TB with 46 uncompressed columns to 0.86TB after compressing all the columns. That’s more than 66% of freed storage!
Just by optimizing their tables, they were able to reduce their cluster size from 23 dc1.xlarge nodes with a cost of ~$4,500 to 8 dc1.xlarge nodes that cost ~$1,400. That’s over a $3,000 monthly cost reduction!
If you only take one thing away from this blog post, it should be this: “Always encode your columns”.
Building an automatic tool that scans our client’s tables and alerts them when it finds un-encoded columns was such an important step in reducing costs. You don’t have to decide about the encoding types when you create the table. You can always do it later after you load some data into it.
Here is the procedure to perform a full table compression using a deep copy logic:
- Ask Redshift what the recommended compression is for your data.
analyze compression schema.table;
Column | Encoding ------------+----------- column_a | lzo column_b | bytedict column_c | mostly16 column_d | delta32k
(Read here to understand more about the encoding types).
- Create a new table with the same structure as the original table but with the proper encoding recommendations. Remember, do not encode your sort key. You should leave it raw for Redshift that uses it for sorting your data inside the nodes.
- Copy all the data from the original table to the encoded one.
- Rename the table’s names.
- Drop the uncompressed table.
This procedure might become challenging when the size of the table is big or when you have near real-time loads to your tables. We solved it in ironSource Atom by enabling a feature that we call “pause” on your table data load, and running a deep copy in small steps and not a bulk insert of all the data.
I recommend that you take a look at this utility for encoding columns:
Every Redshift user must be familiar with this process. If you want to:
- Keep your data distributed across the nodes.
- Sort your data inside the nodes.
- Free disk space after deleting data.
You should run Vacuum from time to time. (Check out these docs). Here is a screenshot of freed up disk space. You’ll love it :)
After we compressed and vacuumed our tables, let’s dig into more price reduction practices.
What Node types to choose
For this article’s sake, let’s assume that these are our business needs:
- Raw events size is ~50GB a day.
- Clients should see aggregated data in dashboards of the last 90 days.
- In some cases, we will need data from the last 365 days.
When we start our first Redshift cluster, we would probably choose 1 dc1.xlarge node to enjoy it’s high performance SSD disks, then resize to the amount of nodes we need for production.
After a couple of resizes, both stability and cost considerations will be needed to take into account.
- Stability: AWS Redshift architecture is implemented in a way that if 1 node crashes, the entire cluster will be unavailable.That means that the chance of a cluster downtime goes up with the amount of nodes. Best practice would be shifting to dc1.8xlarge nodes if you need more than 32 dc1.xlarge ones.
- Cost: This will be the numbers if we will try to support the needs listed above with dc1 node types:
A year of storage will be: 365*50GB = ~18TB. A dc1.xlarge node can contain 0.16 TB, it means we will need ~113 nodes of this type. That’s way over the 32 best practice, so we will actually need 7 dc1.8xlarge. Another Best practice is to keep 40% disk space free, so adding it up, will sum to 10 dc1.8xlarge. This leaves us with a total of ~$37,000 monthly bill.
HOT/COLD clusters — introducing “Data Life cycle”
We started to examine our data usage and we noticed that 80% of our queries are running on the last 90 days, and only 20% of them run on older data. This is when we decided to add another Redshift cluster with ds2.xlarge nodes HDD disks. By doing so, we “compromised” the performance of these 20% queries but saved over 17K.
From now on, I will refer to the clusters as “HOT” and “COLD”.
Here’s the math:
- Keeping only 90 days (4.5TB) in the “hot” cluster will allow us to reduce it’s size to 3 dc1.8xlarge nodes. Total cost of~$11,000 monthly.
- We still want to query data of a full year, so we will keep the 18TB of data in 13 ds2.xlarge nodes. These will cost us ~$9,000 monthly bill.
By adding this hot/cold clusters logic we managed to reduce our monthly bill to $20,000. “Data Life Cycle” is how we call the process that deletes and moves data from one cluster to another. In ironSource Atom we manage the data life cycle for our clients out of the box, as we understand that it gives our clients the benefits that it gave internally.
Here is the basic flow of the data life cycle that can be triggered every day:
- Unload yesterday’s data from the hot cluster to S3:
- Copy it to the cold cluster
- Delete old data from the hot cluster:
- Vacuum — we will get back to it later
We also need to free space from the cold cluster, but we don’t want to delete data without backing it up first.
In this case we will usually UNLOAD a week/month of data to S3, for backup and then DELETE it from the cluster.
Saving data in S3 might also become expensive as the data grows, so if it is not in use, it is a good practice to move it to Glacier or to change its policy to: Standard — Infrequent Access, based on your use case.
Handle crises like a boss — Auto incremented snapshots
We have come across use cases in the past when we needed to query a full year of data as soon and as fast as possible. We had all the data in our cold cluster, but the queries were just not fast enough for us.
Loading a whole year of data from S3 to a dc1 nodes type Redshift may take days. We needed a faster recovery solution.
Auto incremented snapshot
We decided to run a weekly job that will restore a snapshot, insert the last week to it, and create a new snapshot with the new incremented data. This way we are able to create a new SSD disks cluster in minutes with all our data ready to be queried.
The snapshot is stored in S3 in a queryable format so you don’t need to wait for the restore process to reach 100%, you can query the data as soon as your cluster becomes available.
- Always encode your columns but leave the sort key out of it.
- Vacuum, vacuum, vacuum.
- If you need more than 32 dc1.xlarge nodes, you would probably prefer using dc1.8xlarge nodes.
- Use ds2 nodes for data that is not performance dependent.
- Use data lifecycle logic to save costs.
- Incremental snapshots — query while restoring.