ironSource blog

The Hitchhiker’s Guide to AWS Redshift (Pt. 1, With Great Power Comes Great Performance Issues)

This post originally appeared on Medium. Pavel Kerbel is a Software Development Team Leader at ironSource. 

If you are reading this, then common SQL/No-SQL database solutions probably aren’t robust enough for your needs.

That was the case with ironSource 3 years ago.

After weighing different options, we chose AWS Redshift to be our data warehouse. But to really benefit from its incredible power, we needed to implement some creative solutions that we developed along the way.

Eventually, friends in the industry became interested in our solution, and we decided to offer our big data pipeline solution to the public. It became known as ironSource Atom.

In this series of blog posts, I’ll share insider tips and show you the quickest way to supercharge Redshift’s performance, improve cost efficiency and user/access management.

Part 1 will focus on slow performance and errors handling, and if you reach the end, a little bonus, too.

Lets dive in.

Handling slow performance

Even AWS Redshift has a point when it starts to perform slowly. Usually it’s because we either think it can handle any amount of load or we are not careful enough with our queries.

Here is how your running queries graph should NOT look like:

Screenshot from AWS Redshift console — under “Performance”

Let’s debug it and find out who we should blame for the bad load.

See running queries:

Result:

status | starttime | duration | user_name | db_name | query | pid
-------+-----------+----------+-----------+---------+-------+------
Running| 2016-09...| 28156645 | user_name | dbname  |select.| 12275

Now we can start killing irrelevant queries by running:

cancel 12275;

Sometimes the source of the problem is a connection leak from one of the servers. In this case, you want to know from which IP the queries are running:

Result:

recordtime | username | dbname |  remotehost  | remoteport
-----------+----------+--------+--------------+-----------
2016-09-...| user_name| dbname | 10.49.42.138 |   1234

Another issue might be a lock on a table.
Here is how you find out:

Result:

table_id | last_update | lock_owner | lock_owner_pid
---------+-------------+------------+----------------
 100004  | 2016-09-....|    1043    |      5656

Then you can kill a locking sessions by running:

select pg_terminate_backend(5656);

Usually these queries will be enough to solve your current performance problem. But in order to prevent these issues in the future, I recommend looking up some best practices. You can start with digging into amazon-redshift-utils in github.

Check out this blog post by Michael Loewenstein to learn how to “Squeeze The Juice Out Of Redshift” and continuously improve your queries’ run time.


Fixing insert errors (stl_load_errors)

The recommended way to load massive amounts of data to Redshift is using a COPY command. Using this mechanism you can tell Redshift to read data from files stored in an S3 bucket.

Unfortunately, the values in your S3 files will not always fit the table’s structure, meaning that rows might not be inserted.

When the COPY command finishes, you might see this response:

INFO: Load into table ‘events’ completed, 1000 record(s) could not be loaded. Check ‘stl_load_errors’ system table for details.

Sometimes these 1000 (or more over time) records might be valuable, and you will not want to lose them.

At ironSource Atom we know that every row is valuable, so we decided to present these errors in our dashboards:

A more detailed view will display more information:

We even offer the ability to automatically fix the error and reinsert the row (soon to come).

Here’s how we know which columns caused the errors:

Result:

   day   | colname |   err_reason  |  amount
---------+---------+---------------+----------
2016-09..|  col_1  | String leng...|  10000
2016-09..|  col_2  | Invalid dig...|  4000

Now we have two approaches for inserting the problematic rows:

Super delicate surgical operation:

Set up a whole “fix” programmatic operation by running the following procedure:

1. Get all the S3 file names that contain bad rows:

Result:

   day   | colname |   err_reason  | filename | line_number 
---------+---------+---------------+----------+-------------
2016-09..|  col_1  | String leng...|  s3://a..|     433
2016-09..|  col_2  | Invalid dig...|  s3://b..|     800

2. Iterate over all the S3 files that appear in the “filename” field.

3. Download it and find the specific problematic row using the one that appears in the “line_number” field.

4. Find the column from the “column field” and fix its problematic value according to the “err_reason” field:

String length exceeds DDL length” — truncate the length to fit the column in Redshift.

Missing data for not-null field” — put some default value.

Invalid digit, Value ‘.’, Pos 0, Type: Integer” — usually it is a float value that should be an int.

5. Write a new file with the fixed rows to S3 and COPY it to Redshift.

Using Redshift’s speed and some SQL:

In this approach we will create a temp table that will hold all the columns of the original table but with a small change in the problematic column.

For example, if this is our table:

CREATE TABLE schema.table
(
   id    bigint,
   ts    timestamp,
   name  varchar(256)
);

and the errors that we see in stl_load_errors are in the “name” column, the procedure will be as follows:

  1. Create a new table:
CREATE TABLE schema.table_temp
(
   id    bigint,
   ts    timestamp,
   name  varchar(2560)
);

2. COPY the data from the S3 again. This time, all the data that wasn’t inserted into the original table due “String length exceeds DDL length”, will be inserted.

3. Now all we need to do is to delete a range of data from the original table and to insert a fixed data point from the temp table:

DELETE
FROM schema.table
WHERE ts BETWEEN 'start_date' AND 'end_date';

4. and insert:

INSERT INTO schema.table
(
  id,
  ts,
  name
)
SELECT id,
       ts,
       SUBSTRING(name,0,64)
FROM schema.table_temp
WHERE ts BETWEEN 'start_date' AND 'end_date';

5. Don’t forget to drop the temp table.

If you handle errors differently, I would love to hear your ideas in the comments.

It is worth mentioning here that in order to minimize the errors in the copies, you should try to send the data in the proper structure. But from our experience, there are always surprises and changes in your events structures, which is why, at ironSource Atom, we validate and fix our data during the transformation layer, before it is written to S3 and copied to Redshift.


Bonus: Hanging queries

Yes, sometimes your query might hang and not really be executed. Here’s why it might happen: Redshift Leader node runs with MTU 1500.

Maximum transmission unit (MTU) is the largest size packet or frame that can be sent in a packet- or frame-based network.

EC2 instances with advanced networking are set up with “MTU 9000” (e.g c4.large) meaning they support TCP/IP “jumbo frame” and can enhance throughput in the network. However Redshift does not support jumbo frames, so the network of the instances that is running the queries needs to adjust its MTU to 1500.

To see your instance MTU status, you can use the ‘ifconfig’ or the ‘ip’ commands.
For example:

ip link show eth0

Then you should change it to 1500:

sudo ip link set dev eth0 mtu 1500

You should no longer suffer from hanging queries after this fix.

Stay tuned for more good stuff.

ironSource Blog

The Source

Ad Monetization and User Acquisition: Q&A with Tapps Games

Ad monetization and user acquisition are two sides of the same coin. For example, UA teams are dedicated to realizing true app growth, whereas monetization teams are driven to generate revenue. Taking this into account, it’s almost logical that gaming..

IndustryChat

What’s the Hype with Hyper-Casual Games? Part 3: Marketing

In our final chapter of LevelUp’s hyper-casual series, Ryan Davies, Digital Marketing Manager at Kwalee, shares his thoughts on running successful user acquisition campaigns for hyper-casual games. Read on for edited highlights from Davies’ podcast: Failing “Failing is critical, especially for hyper-casual games because you’re dealing with such a fast rate of content. You have..

Nordeus Talks Ad Monetization, Brands, and Industry Trends

Nordeus, a Serbian mobile game studio, is far from being the new kid on the block. Their hit title Top Eleven has been live for nearly a decade and has 200 million registered users, across Facebook, iOS, and Android. While Nordeus’ games are more IAP focussed, in-game ads still provide a significant source of revenue..

IndustryChat

The Future of Programmatic Mediation

In this episode of LevelUp, Pieter Kooyman, Chief Advertising Officer at Miniclip, sets the record straight on ‘programmatic mediation’ . He defines the term and predicts what the mobile industry can expect to see in the new year. Read on for edited highlights from Kooyman’s podcast: Many names “Let’s start by listing all the names..

IndustryChat

2018’s Most Popular Level Up Podcasts

2018 was quite the year for ironSource’s LevelUp Podcast. We launched the LevelUp podcast, produced 13 episodes (and 2 seasons!), generated thousands of listens, and made some awesome friends along the way. With the new year upon on us, we thought to curate a list of our most listened to LevelUp podcast episodes with some..

AdChat

The Convergence of Monetization and Design

One of the biggest mistakes a developer can make is only thinking about monetization strategies after designing their game. The choices you make in the initial phases of designing your game will influence the success (or failure) of your app monetization model. We spoke with Matthieu Brossard, the Publishing Director at What(Games) by Gameloft. What(Games)..

아이언소스 팁: 휴가 방식으로 UA하기

1. 여러 개의 창의적인 포맷을 구동하세요 일부 게임의 경우, 플레이어블(Playable)광고가 가장 수익성 높은 광고 포맷입니다. 다른 게임들의 경우, 비디오와 인터렉티브 엔드 카드(IEC)의 결합이 가장 좋은 광고 포맷이지요. 무엇이 여러분들에게 가장 적합한 포맷인지를 알아내는 유일한 방법은 각각 A/B 테스트를 해보는 겁니다. 우리는 보통 플레이어블 광고가 보상형 비디오 공급에서 가장 좋은 퍼포먼스를 발휘하고, IEC 방식은 중간광고에서 가장..

ironSource Rock: UA the Holiday Way

Reading the same decks, talking about the same best practices time after time gets a bit boring. We thought to share our best user acquisition tips in a completely different format. So we recorded a music video. Now sit back, relax, and enjoy our Christmas gift to you! 1. Run multiple creative formats For some..

IndustryChat

How to Make Hyper-Casual Games, Part 2: Trends

In this episode of LevelUp, Tom Kinniburgh, a mobile consultant for game studios, who also runs the popular blog Mobile Free to Play, talks about the history of hyper-casual, monetization and user-acquisition best practices, working with publishers, and more. This episode is part two of a three part series focused on designing, monetizing, and marketing..

IndustryChat

Highlights from IAB’s Opt-in Value Exchange Advertising Playbook for Brands

Opt-in ads The IAB just released the ultimate playbook for brand advertisers. The first guide of its kind, it gives brand advertisers all the information they need when it comes to rewarded or opt-in ads. According to the IAB, “opt-in value exchange advertising is an ad format that is growing in importance … numerous studies..

Interested in becoming an ironSource partner?

We’d love to hear from you.