Scheduling SQL queries on your Amazon Redshift data warehouse

Blog-Post_thumbnail.png

Amazon Redshift is the most popular cloud data warehouse today, with tens of thousands of customers collectively processing over 2 exabytes of data on Amazon Redshift daily. Amazon Redshift is fully managed, scalable, secure, and integrates seamlessly with your data lake. In this post, we discuss how to set up and use the new query scheduling feature on Amazon Redshift.

Amazon Redshift users often need to run SQL queries or routine maintenance tasks at a regular schedule. You can now schedule statements directly from the Amazon Redshift console or by using the AWS Command Line Interface (AWS CLI) without having to use scripting and a scheduler like cron. You can schedule and run the SQL statement using Amazon EventBridge and the Amazon Redshift Data API. The results are available for 24 hours after running the SQL statement. This helps you in a variety of scenarios, such as when you need to do the following:

  • Run SQL queries during non-business hours
  • Load data using COPY statements every night
  • Unload data using UNLOAD nightly or at regular intervals throughout the day
  • Delete older data from tables as per regulatory or established data retention policies
  • Refresh materialized views manually at a regular frequency
  • Back up system tables every night

EventBridge is a serverless event bus service that makes it easy to connect your applications with data from a variety of sources. EventBridge delivers a stream of real-time data from your own applications, software as a service (SaaS) applications, and AWS services and routes that data to targets including Amazon Redshift clusters. Amazon Redshift integrates with EventBridge to allow you to schedule your SQL statements on recurring schedules and enables you to build event-driven applications. Beside scheduling SQL, you can also invoke the Amazon Redshift Data API in response to any other EventBridge event.

When creating a schedule using the Amazon Redshift console, you create an EventBridge rule with the specified schedule and attach a target (with the Amazon Redshift cluster information, login details, and SQL command run) to the rule. This rule then runs as per the schedule using EventBridge.

In this post, we describe how you can schedule SQL statements on Amazon Redshift using EventBridge, and also go over the required security privileges and the steps to schedule a SQL statement using both the AWS Management Console and the AWS CLI.

Prerequisites

To set this up, we need to make sure that the AWS Identity and Access Management (IAM) user (which we use to create the schedule and access the schedule history), the IAM role, and the AWS secret (which stores the database user name and password) are configured correctly.

  1. Make sure that the IAM user who is going to create the schedule has the AmazonEventBridgeFullAccess IAM policy attached to it. The IAM user should also have appropriate access to Amazon Redshift as per their role. We use the placeholder {USER_NAME} to refer to this IAM user in this post.
  1. Store the database credentials to be used for running the scheduled SQL statement securely using AWS Secrets Manager. If you already have a secret created, you can use that. If not, create a new secret to store the Amazon Redshift database name and user name.
  2. Create an IAM role for the Amazon Redshift service with the “Redshift Customizable” option and attach the AmazonRedshiftDataFullAccess AWS managed policy to it.
    1. Ensure that the role has the following trust relationships added to it:
      {
            "Sid": "S1",
            "Effect": "Allow",
            "Principal": {
              "Service": "events.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
          }

    2. Add the following AssumeRole permissions so the user can see the schedule history list. Replace the {ACCOUNT_ID} with your AWS account ID and {USER_NAME} with the IAM user you set up:
      {
            "Sid": "S2",
            "Effect": "Allow",
            "Principal": {
              "AWS": "arn:aws:iam::{ACCOUNT_ID}:user/{USER_NAME}"
            },
            "Action": "sts:AssumeRole"
      }

    We use the placeholder {ROLE_NAME} to refer to this role in this post.

  1. In addition to the preceding AssumeRole permissions, the IAM user has to be granted AssumeRole permissions on the IAM role you created in order to view the schedule history:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "S3",
                "Effect": "Allow",
                "Action": "sts:AssumeRole",
                "Resource": "arn:aws:iam::{ACCOUNT_ID}:role/{ROLE_NAME}"
            }
        ]
    }

Now that we have completed the security setup required to schedule SQL statements and view their history, let’s schedule a SQL statement to run at a regular frequency. We can do this using the console or the AWS CLI. We discuss both approaches in this post.

Scheduling the SQL statement using the console

Let’s take the example of a fairly common use case where data from a table has to be extracted daily (or at another regular frequency) into Amazon Simple Storage Service (Amazon S3) for analysis by data scientists or data analysts. You can accomplish this by scheduling an UNLOAD command to run daily to export data from the table to the data lake on Amazon S3. See the following code:

unload ('select * from edw.trxns')
to 's3://mybucket/'
iam_role 'arn:aws:iam::{ACCOUNT_ID}:role/{ROLE_NAME_2}'
PARQUET
PARTITION BY (trxn_dt);
;

{ROLE_NAME_2} in the preceding code is not the same as {ROLE_NAME}. {ROLE_NAME_2} should be an IAM role that has permissions to run the UNLOAD command successfully.

We can schedule this UNLOAD statement to run every day at 4:00 AM UTC using the following steps:

  1. Sign in to the console. Make sure the IAM user has been granted the necessary permissions.
  2. On the Amazon Redshift console, open the query editor.
  3. Choose Schedule.

  1. In the Scheduler permissions section, for IAM role, choose the role you created earlier.

If you don’t have IAM read permissions, you may not see the IAM role in the drop-down menu. In that case, you can enter the Amazon Resource Name (ARN) of the IAM role that you created.

  1. For Authentication, select AWS Secrets Manager.
  2. For Secret, choose the secret you configured earlier, which contains the database user name and password.

You can also use Temporary credentials for authentication as explained in the AWS documentation.

  1. For Database name, enter a name.

For Database name, enter a name.

  1. In the Query Information section, for Scheduled query name, enter a name for the query.
  2. For SQL query, enter the SQL code.

You also have the ability to upload a SQL query from a file.

You also have the ability to upload a SQL query from a file.

  1. In the Scheduling options section, for Schedule query by, select Run frequency.
  2. For Repeat by, choose Day.
  3. For Repeat every, enter 1.
  4. For Repeat at time (UTC), enter 04:00.

For Repeat at time (UTC), enter 04:00.

  1. In the Monitoring section, you can choose to enable notifications via email or text using Amazon Simple Notification Service (Amazon SNS).

If you decide to enable notifications, make sure that the Publish action has been granted to events.amazonaws.com on the SNS topic. You can do this by adding the following snippet to the access policy of the SNS topic. Replace ACCOUNT_ID and SNS_TOPIC_NAME with appropriate values. If you choose to create a new SNS topic during the schedule creation process, then the following access is granted automatically.

{
      "Sid": "Allow_Publish_Events",
      "Effect": "Allow",
      "Principal": {
        "Service": "events.amazonaws.com"
      },
      "Action": "sns:Publish",
      "Resource": "arn:aws:sns:us-east-1:{ACCOUNT_ID}:{SNS_TOPIC_NAME}"
  }

Scheduling the SQL statement using the AWS CLI

You can also schedule SQL statements via the AWS CLI using EventBridge and the Amazon Redshift Data API. For more information about the Amazon Redshift Data API, see Using the Amazon Redshift Data API to interact with Amazon Redshift clusters.

In the following example, we set up a schedule to refresh a materialized view (called mv_cust_trans_hist) on Amazon Redshift daily at 2:00 AM UTC.

  1. Create an event rule. The following command creates a rule named scheduled-refresh-mv-cust-trans-hist and schedules it to run daily at 2:00 AM UTC. The schedule expression can be provided using cron or rate expressions.
    aws events put-rule 
    --name scheduled-refresh-mv-cust-trans-hist 
    --schedule-expression "cron(0 22 * * ? *)"

  1. Create a JSON object that contains the Amazon Redshift Data API parameters:
    1. For ARN, enter the ARN of your Amazon Redshift cluster.
    2. For the RoleArn and the SecretManagerArn fields, use the ARNs for the role and secret you created earlier.
    3. Enter the database name and the SQL statement to be scheduled for the database and SQL fields.
    4. You can enter any name for the StatementName field.
    5. If you want an event to be sent after the SQL statement has been run, you can set the WithEvent parameter to true. You can then use that event to trigger other SQL statements if needed.
      {
      "Rule": "scheduled-refresh-mv-cust-trans-hist",
      "EventBusName": "default",
      "Targets": 
      [
      {
      "Id": "scheduled-refresh-mv-cust-trans-hist",
      "Arn": "arn:aws:redshift:us-east-1:{ACCOUNT_ID}:cluster:{REDSHIFT_CLUSTER_IDENTIFIER}",
      "RoleArn": "arn:aws:iam::{ACCOUNT_ID}:role/{ROLE_NAME}",
      "RedshiftDataParameters": 
      {
      "SecretManagerArn": "arn:aws:secretsmanager:us-east-1:{ACCOUNT_ID}:secret:{SECRET_NAME-xxxxxx}",
      "Database": "dev",
      "Sql": "REFRESH MATERIALIZED VIEW mv_cust_trans_hist;",
      "StatementName": "refresh-mv-cust-trans-hist",
      "WithEvent": true
      }
      }
      ]
      }

  1. Create an event target using the JSON file created in the previous step:
    aws events put-targets --cli-input-json file://data.json

Additional commands

We have now set up the schedule to refresh the materialized view using the AWS CLI. Let’s quickly go over a few more CLI commands that are useful while scheduling tasks:

  • To list all targets for a particular rule, use:
    aws events list-targets-by-rule --rule <rule-name> 

  • To list all rules, use:
  • To remove a target from a specific rule, use:
    aws events remove-targets --rule <rule-name> --ids 2 

  • To delete a rule, use:
    aws events delete-rule --name <rule-name>

  • To view the schedule history for a particular scheduled SQL statement, use:
    aws redshift-data list-statements --status ALL --statement-name <statement-name>

Retrieving the SQL status and results

After the schedule is set up, scheduled queries might be listed in the following places:

  • On the Schedules tab of the details page of your cluster.
  • On the scheduled queries list that you can reach from the navigation pane. To see the list, in the navigation pane, choose Queries and Schedule query list.
  • On the Scheduled queries tab of the query editor (see the following screenshot).

On the Scheduled queries tab of the query editor (see the following screenshot).

  1. Choose the schedule name to see more details about the scheduled query, including details about any previous runs of the schedule.
  2. In the Schedule history section, you can see the ID (which can be used to retrieve SQL statement results), start time, end time, status, and elapsed time for each previous run of the scheduled SQL statement.

In the Schedule history section, you can see the ID.

To retrieve the SQL results, you need to use the AWS CLI (or AWS SDK), but you have to assume the role you created earlier.

  1. To assume this role, run the following command on the command line using the IAM user you configured.
    aws sts assume-role --role-arn "arn:aws:iam::{Account_ID}:role/{Role_Name}" --role-session-name AWSCLI-Session

The command returns a result set similar to the following code:

{
    "Credentials": {
        "AccessKeyId": "XXXXXXXXXXXX",
        "SecretAccessKey": "XXXXXXXXXXXXXXXXX",
        "SessionToken": "xxxxxxxxxxxxxxxxxxxxx”
        "Expiration": "2020-10-09T17:13:23+00:00"
    },
    "AssumedRoleUser": {
        "AssumedRoleId": "XXXXXXXXXXXXXXXXX:AWSCLI-Session",
        "Arn": "arn:aws:sts::{Account_ID}:assumed-role/{Role_Name}/AWSCLI-Session"
    }
}

  1. Create three environment variables to assume the IAM role by running the following commands. Use the access key, secret access key, and the session token from the preceding results.
    export AWS_ACCESS_KEY_ID=RoleAccessKeyID
    export AWS_SECRET_ACCESS_KEY=RoleSecretKey
    export AWS_SESSION_TOKEN=RoleSessionToken

  1. Run the following command to retrieve the results of the SQL statement. Replace the ID with the ID from the schedule history on the console.
    aws redshift-data get-statement-result --id xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx --region us-east-1 

Conclusion

The ability to schedule SQL statements using the Amazon Redshift Data API and EventBridge simplifies running routine tasks that previously required scripting. You can configure schedules and manage them either via the console or the AWS CLI. You can also see the previous runs of any scheduled SQL statements directly from the console and choose to be notified when it runs.


About the Authors

Sain Das is an Analytics Specialist Solutions Architect at AWS and helps customers build scalable cloud solutions that help turn data into actionable insights.

 

 

 

Vijetha Parampally Vijayakumar is a full stack software development engineer with Amazon Redshift. She is specialized in building applications for Big data, Databases and Analytics.

 

 

 

André Dias is a Systems Development Engineer working in the Amazon Redshift team. André’s passionate about learning and building new AWS Services and has worked in the Redshift Data API. He specializes in building highly available and cost-effective infrastructure using AWS.

Credit: Source link