Data has become an essential part of every business, and its volume, velocity, and variety continue to increase. This has resulted in more complex ETL jobs with interdependencies between each other. There is also a critical need to be agile and automate the workflow—from changing the ETL jobs due to business requirements to deploying it into production. Failure to do so increases the time to value and cost of operations.
In this post, we show you how to automate the deployment of Amazon Redshift ETL jobs using AWS Batch and AWS CodeBuild. AWS Batch allows you to run your data transformation jobs without having to install and manage batch computing software or server clusters. CodeBuild is a fully managed continuous integration service that builds your data transformation project into a Docker image run in AWS Batch. This deployment automation can help you shorten the time to value. These two services are also fully managed and incur fees only when run, which optimizes costs.
We also introduce a third-party tool for the ETL jobs: DBT, which enables data analysts and engineers to write data transformation queries in a modular manner without having to maintain the execution order manually. It compiles all code into raw SQL queries that run against your Amazon Redshift cluster to use existing computing resources. It also understands dependencies within your queries and runs them in the correct order. DBT code is a combination of SQL and Jinja (a templating language); therefore, you can express logic such as if statements, loops, filters, and macros in your queries. For more information, see DBT Documentation.
The following illustration shows the architecture of the solution:
The steps in this workflow are as follows:
- A data analyst pushes their DBT project into a GitHub repo.
- CodeBuild is triggered and builds a Docker image from the DBT project. It reads Amazon Redshift and GitHub credentials from AWS Secrets Manager. The image is stored in Amazon Elastic Container Registry (Amazon ECR).
- Amazon CloudWatch Events submits an AWS Batch job on a scheduled basis to run the Docker image located in Amazon ECR.
- The AWS Batch job runs the DBT project against the Amazon Redshift cluster. When the job is finished, AWS Batch automatically terminates your Amazon Elastic Compute Cloud (Amazon EC2) resources so there is no further charge.
- If the DBT job fails, Amazon Simple Notification Service (Amazon SNS) notifies the data analyst via email.
Consequent commits pushed to the GitHub repo trigger CodeBuild, and the new version of the code is used the next time the ETL job is scheduled to run.
All code used in this post is available in the GitHub repo.
You need the following items to complete the steps in this post:
- An AWS account with permissions to manage these services.
- You need to use Region
- You need to use Region
- An empty GitHub repo. You use this to store the DBT project later.
Setting up the Amazon Redshift cluster
Your first step is to set up an Amazon Redshift cluster for the ETL jobs. The AWS CloudFormation template provided in this post deploys an Amazon Redshift cluster and creates the tables with the required data. The tables are created in the
You can use Amazon Redshift Query Editor to verify that the tables have been created in the public schema. The Amazon Redshift credentials are as follows:
- Username –
- Password – Available in Secrets Manager with the name
- Database name –
Deploying the automation pipeline
Now that we have the Amazon Redshift cluster, we’re ready to deploy the automation pipeline. The following stack builds the rest of our architecture and schedules the DBT job.
You must provide the following parameters for the CloudFormation template:
- GithubRepoUrl – The GitHub repo URL of the DBT project; for example,
- GithubToken – Your GitHub personal token. If you don’t have a token, you can create one. Make sure the token scope contains both
- JobFrequency – The frequency of the DBT job in cron format. Time is in UTC time zone. For example,
0 4 * * ? *runs the job every day at 4:00 UTC.
- MonitoringEmail – The email address that receives monitoring alerts.
- RedshiftStackName – The name of the CloudFormation stack where we deployed the Amazon Redshift cluster.
- RedshiftSchema – The name of the schema with your data inside Amazon Redshift cluster.
- GithubType – Whether you’re using public GitHub (github.com) or GitHub Enterprise from your company.
This template creates a CodeBuild project with a webhook configured to trigger a build whenever there is a change in the GitHub repo. The template also creates an AWS Batch job queue, job definition, compute environment, and CloudWatch event that is scheduled according to the
JobFrequency parameter. The AWS Batch job runs the command
dbt run –profiles-dir . to start the DBT jobs (for more information, see dbt run). Finally, the SNS topic for data job failures is also set up.
After the template is complete, you receive an email from email@example.com to confirm your subscription for the error topic. Choose Confirm subscription.
Setting up your DBT project in GitHub
You can download the DBT project repository that we provided for this post.
Push this project to your GitHub repo that you specified when deploying the CloudFormation stack. Ensure that your repository has the same folder structure as the sample; it’s required for CodeBuild and AWS Batch to run correctly.
This repository contains the generic DBT project (inside
/src/dbt-project) with some wrappers for the automation. In this section, we further examine the files in the provided repository.
/src/ folder, Dockerfile configures how the Docker image is built. Here, it installs the DBT library.
The DBT project is located in
profiles.yml file contains the connection details to the Amazon Redshift cluster, and CodeBuild configures the placeholder values.
models/example/ folder, you can see two queries that run against the tables we created earlier:
top_nations.sql. In DBT, each .sql file corresponds to a DBT model. We specify in both models that the query results are materialized as new tables. See the following screenshot.
You can also see that the
top_nations model uses the result of the
top_customers model using the ref function on line 9. This is sufficient for DBT to understand the dependencies and to run the
top_customers model prior to
/config folder, the
buildspec.yml file specifies the configuration the CodeBuild project uses. On line 3, we take the Amazon Redshift credentials from Secrets Manager (
$REDSHIFT_PASSWORD_SECRET resolve to
redshift-creds:password, respectively). It also replaces the placeholders in
profiles.yml with the value from Secrets Manager and the CloudFormation stack. Finally, it runs
docker build, which uses the Dockerfile in the
Testing the ETL Job
To test the pipeline, we push a Git commit to trigger the build.
- In the DBT project repository, we modify the
top_nations.sqlfile. We also want to select the
n_commentcolumn from nation table and include it in the
top_nations.sql looks like the following screenshot.
- Commit and push the change to the repository with the following commands:
git commit -am “update top_nations”
- On the CodeBuild console, choose the DBT build project.
You should see that a build is currently running. When the build is complete, we trigger the AWS Batch job using a CloudWatch event.
- On the CloudWatch console, under Events, choose Rules.
- Search for the rule that contains
- Select that rule and from the Actions drop-down menu, choose Edit.
- For testing purposes, you can change the cron expression temporarily so that the job is triggered 2–3 minutes from now. For example, if the current time is 15:04 GMT, you can put
6 15 * * ? *so that it runs at 15:06 GMT.
- Choose Configure details.
- Choose Update rule.
- On the AWS Batch console, choose Jobs.
You can see the progress of your job; it has the status
- When the job is successful, go to the Amazon Redshift Query Editor.
In the public schema, you should now see two new tables:
top_nations. These are the results of the DBT queries.
To avoid additional charges, delete the resources used in this post.
- On the Amazon ECR console, choose the repository named
dbt-batch-processing-job-repositoryand delete the images within it.
- On the AWS CloudFormation console, delete the two stacks we created.
- In GitHub, delete the personal token that was created for this project. To find your tokens, complete the following:
- Choose your profile photo.
- In Settings, choose Developer settings.
- Choose Personal access tokens.
This post demonstrates how to implement a pipeline to automate deploying and running your ETL jobs. This pipeline allows you to reduce time to market because it automates the integration of changes in your ETL jobs. Thanks to the managed services available in AWS, you can also implement this pipeline without having to maintain servers, thus reducing operational costs. You can set up the process using CloudFormation templates, which allows you to replicate the pipeline in multiple environments consistently.
We also introduce you to a third-party data transformation tool, DBT, which helps implement ETL jobs by managing dependencies between queries. Because DBT code combines SQL and Jinja, you can develop your queries in plain SQL or combine with Jinja to express more complex logic, such as if statements and loops.
If you have any questions or suggestions, leave your feedback in the comment section. If you need any further assistance to optimize your Amazon Redshift implementation, contact your AWS account team or a trusted AWS partner.
About the Authors
Megah Fadhillah is a big data consultant at AWS. She helps customers develop big data and analytics solutions to accelerate their business outcomes. Outside of work, Megah enjoys watching movies and baking.
Amine El Mallem is a DevOps consultant in Professional Services at Amazon Web Services. He works with customers to design, automate, and build solutions on AWS for their business needs.
Credit: Source link