This post originally appeared on Medium. Pavel Kerbel is a Software Development Team Leader at ironSource.
If you are reading this, then common SQL/No-SQL database solutions probably aren’t robust enough for your needs.
That was the case with ironSource 3 years ago.
After weighing different options, we chose AWS Redshift to be our data warehouse. But to really benefit from its incredible power, we needed to implement some creative solutions that we developed along the way.
Eventually, friends in the industry became interested in our solution, and we decided to offer our big data pipeline solution to the public. It became known as ironSource Atom.
In this series of blog posts, I’ll share insider tips and show you the quickest way to supercharge Redshift’s performance, improve cost efficiency and user/access management.
Part 1 will focus on slow performance and errors handling, and if you reach the end, a little bonus, too.
Lets dive in.
Handling slow performance
Even AWS Redshift has a point when it starts to perform slowly. Usually it’s because we either think it can handle any amount of load or we are not careful enough with our queries.
Here is how your running queries graph should NOT look like:
See running queries:
status | starttime | duration | user_name | db_name | query | pid -------+-----------+----------+-----------+---------+-------+------ Running| 2016-09...| 28156645 | user_name | dbname |select.| 12275
Now we can start killing irrelevant queries by running:
Sometimes the source of the problem is a connection leak from one of the servers. In this case, you want to know from which IP the queries are running:
recordtime | username | dbname | remotehost | remoteport -----------+----------+--------+--------------+----------- 2016-09-...| user_name| dbname | 10.49.42.138 | 1234
Another issue might be a lock on a table.
Here is how you find out:
table_id | last_update | lock_owner | lock_owner_pid ---------+-------------+------------+---------------- 100004 | 2016-09-....| 1043 | 5656
Then you can kill a locking sessions by running:
Usually these queries will be enough to solve your current performance problem. But in order to prevent these issues in the future, I recommend looking up some best practices. You can start with digging into amazon-redshift-utils in github.
Check out this blog post by Michael Loewenstein to learn how to “Squeeze The Juice Out Of Redshift” and continuously improve your queries’ run time.
Fixing insert errors (stl_load_errors)
The recommended way to load massive amounts of data to Redshift is using a COPY command. Using this mechanism you can tell Redshift to read data from files stored in an S3 bucket.
Unfortunately, the values in your S3 files will not always fit the table’s structure, meaning that rows might not be inserted.
When the COPY command finishes, you might see this response:
INFO: Load into table ‘events’ completed, 1000 record(s) could not be loaded. Check ‘stl_load_errors’ system table for details.
Sometimes these 1000 (or more over time) records might be valuable, and you will not want to lose them.
At ironSource Atom we know that every row is valuable, so we decided to present these errors in our dashboards:
A more detailed view will display more information:
We even offer the ability to automatically fix the error and reinsert the row (soon to come).
Here’s how we know which columns caused the errors:
day | colname | err_reason | amount ---------+---------+---------------+---------- 2016-09..| col_1 | String leng...| 10000 2016-09..| col_2 | Invalid dig...| 4000
Now we have two approaches for inserting the problematic rows:
Super delicate surgical operation:
Set up a whole “fix” programmatic operation by running the following procedure:
1. Get all the S3 file names that contain bad rows:
day | colname | err_reason | filename | line_number ---------+---------+---------------+----------+------------- 2016-09..| col_1 | String leng...| s3://a..| 433 2016-09..| col_2 | Invalid dig...| s3://b..| 800
2. Iterate over all the S3 files that appear in the “filename” field.
3. Download it and find the specific problematic row using the one that appears in the “line_number” field.
4. Find the column from the “column field” and fix its problematic value according to the “err_reason” field:
“String length exceeds DDL length” — truncate the length to fit the column in Redshift.
“Missing data for not-null field” — put some default value.
“Invalid digit, Value ‘.’, Pos 0, Type: Integer” — usually it is a float value that should be an int.
5. Write a new file with the fixed rows to S3 and COPY it to Redshift.
Using Redshift’s speed and some SQL:
In this approach we will create a temp table that will hold all the columns of the original table but with a small change in the problematic column.
For example, if this is our table:
CREATE TABLE schema.table ( id bigint, ts timestamp, name varchar(256) );
and the errors that we see in stl_load_errors are in the “name” column, the procedure will be as follows:
- Create a new table:
CREATE TABLE schema.table_temp ( id bigint, ts timestamp, name varchar(2560) );
2. COPY the data from the S3 again. This time, all the data that wasn’t inserted into the original table due “String length exceeds DDL length”, will be inserted.
3. Now all we need to do is to delete a range of data from the original table and to insert a fixed data point from the temp table:
DELETE FROM schema.table WHERE ts BETWEEN 'start_date' AND 'end_date';
4. and insert:
INSERT INTO schema.table ( id, ts, name ) SELECT id, ts, SUBSTRING(name,0,64) FROM schema.table_temp WHERE ts BETWEEN 'start_date' AND 'end_date';
5. Don’t forget to drop the temp table.
If you handle errors differently, I would love to hear your ideas in the comments.
It is worth mentioning here that in order to minimize the errors in the copies, you should try to send the data in the proper structure. But from our experience, there are always surprises and changes in your events structures, which is why, at ironSource Atom, we validate and fix our data during the transformation layer, before it is written to S3 and copied to Redshift.
Bonus: Hanging queries
Yes, sometimes your query might hang and not really be executed. Here’s why it might happen: Redshift Leader node runs with MTU 1500.
Maximum transmission unit (MTU) is the largest size packet or frame that can be sent in a packet- or frame-based network.
EC2 instances with advanced networking are set up with “MTU 9000” (e.g c4.large) meaning they support TCP/IP “jumbo frame” and can enhance throughput in the network. However Redshift does not support jumbo frames, so the network of the instances that is running the queries needs to adjust its MTU to 1500.
To see your instance MTU status, you can use the ‘ifconfig’ or the ‘ip’ commands.
ip link show eth0
Then you should change it to 1500:
sudo ip link set dev eth0 mtu 1500
You should no longer suffer from hanging queries after this fix.
Stay tuned for more good stuff.