Simplify Snowflake data loading and processing with AWS Glue DataBrew

bdb1348-simplify-snowflake-databrew-1.jpg

Historically, inserting and retrieving data from a given database platform has been easier compared to a multi-platform architecture for the same operations. To simplify bringing data in from a multi-database platform, AWS Glue DataBrew supports bringing your data in from multiple data sources via the AWS Glue Data Catalog. However, this requires you to have a Data Catalog database and a table that you may have to create by crawling your data sources.

With the launch of DataBrew database support, you can securely connect and bring your data from Amazon JDBC databases as well as warehouses natively (without requiring a Data Catalog or AWS Glue crawler). This feature also offers connecting directly to the popular warehousing provider Snowflake with a highly simplified user experience. DataBrew is a visual data preparation service that makes it easy to clean and normalize data by providing over 250 pre-built transforms to automate data preparation tasks, all without writing any code.

In this post, we walk you through how to securely connect to your Snowflake database on the DataBrew console.

Solution overview

The following diagram illustrates our solution architecture.

The architecture includes the following components:

  1. We use an AWS Glue Data Catalog connection of CUSTOM type in DataBrew to store the connection information:
    1. JDBC connection URL including parameters db, warehouse, role, and schema.
    2. An AWS Secrets Manager secret ID containing your Snowflake credentials, which must include user and password keys.
    3. Optionally, Amazon Virtual Private Cloud (Amazon VPC) information in case your Snowflake account is AWS PrivateLink enabled.
    4. The Amazon Simple Storage Service (Amazon S3) URI location to the Snowflake JDBC driver and Spark connector JAR files.
    5. The AWS Glue connection name and the Snowflake table name, which are stored as part of the DataBrew dataset.
  2. We use the DataBrew dataset to create a DataBrew project in which DataBrew gets the secret value if the secret ID was provided when creating the connection.
  3. DataBrew queries sample data from Snowflake using the connection and credentials information including the table name. You can apply any of the over 250 built-in transforms on the sample data and build a recipe. You use the recipe to run a data transformation job on the full data.
  4. The output of the recipe job is written to your S3 bucket.

Let’s dive into how to securely bring data from Snowflake into DataBrew. Our solution contains the following steps:

  1. Create a secret (optional).
  2. Download Snowflake Spark and JDBC drivers.
  3. Create a Snowflake connection.
  4. Create your datasets.
  5. Create a DataBrew project using the datasets.
  6. Open a Snowflake project and build a transformation recipe.
  7. Run the DataBrew recipe job on the full data.

We also briefly discuss how to bring data from Amazon S3 back into Snowflake.

Prerequisites

To complete this solution, you should have the following prerequisites:

Create a secret

To be able to securely store the database access credentials, DataBrew supports using AWS Secrets Manager with Snowflake connections. We explore storing the credentials as plaintext. If you already have a secret or would like to store credentials as part of the AWS Glue Data Catalog connection, you can skip this step.

  1. On the Secrets Manager console, choose Secrets.
  2. Choose Store a new secret.
  3. Select Other types of secrets.
  4. Select Secret key/value.
  5. Enter the key as user and enter the Snowflake user name.
  6. Choose Add row.
  7. Enter password key and enter the password for the Snowflake user.
  8. Select the encryption key and choose Next.
  9. Enter a name for your secret and a description.
  10. Choose Next.
  11. Choose automatic rotation option as per your requirement and choose Next.
  12. Choose Store.

You can now view your secret on the Secrets page.

Download Snowflake Spark and JDBC drivers

DataBrew supports connecting to Snowflake via the Spark-Snowflake connector in combination with the Snowflake JDBC driver. The combination of the two drivers improves performance significantly by reduceing the overall time taken for running large jobs. For more information, see Overview of the Spark Connector.

DataBrew interactive sessions as well as the jobs work on Spark v2.4.3 and Scala v2.11. Based on this, we can use the Spark-Snowflake connector.

  1. Find a compatible Spark connector version from the Spark-snowflake GitHub releases page and download the JAR file from the Central Repository.

For this post, we use version 2.8.3-spark_2.4 of the Spark connector.

  1. Based on the Spark connector version, find the compatible JDBC driver version and download from the central repository page for Snowflake and download the JAR file.

For this post, we use version 3.12.16 of the JDBC driver. Keep the drivers accessible in your local hard drive.

Create a Snowflake connection

To create your snowflake connection, complete the following steps:

  1. On the DataBrew console, choose Datasets.
  2. On the Connections tab, choose Create connection.
  1. For Connection name, enter a name (for example, my-new-snowflake-connection).
  2. Select External Custom connectors.
  1. For JDBC URL, enter the JDBC URL for your database. For example:
    jdbc:snowflake://<account_name>.snowflakecomputing.com/?db=<database_name>&warehouse=<warehouse_name>&role=<role_name>

    Only the db parameter is required; the schema, warehouse, and role parameters are optional. Parameters must be separated by &.

  1. For Database access credentials, select Connect with Secrets Manager.
  2. For Secrets, enter the secret you created earlier.
    You can also enter your username and password directly, but this is a less secure option because it stores the information in plaintext.

You now set up the required connectors and drivers.

  1. For Snowflake Spark connector, select Upload JAR file.
  2. Choose the Spark connector JAR file from your local hard drive.

Alternatively, you can enter the S3 location of the JAR file.

  1. For Custom JDBC driver, select Upload JAR file.
  2. Upload the JDBC JAR file from your local hard drive.

Alternatively, you can enter the S3 location of the JAR file.

  1. For Class name, enter the class name of the supported connector version (for this post, net.snowflake.spark.snowflake).
  2. Under Network options, if your Snowflake account is AWS Private Link enabled, you can choose the Amazon VPC with which your Snowflake instance is associated.
  3. Choose Create connection.

You also can bring your existing connections to Snowflake. For information about verifying that your connection contains all the required information, see Database dataset developer guide.

Create datasets

We have two tables in Snowflake: ORDER_DATA and CUSTOMER_PRIMARY. To use these tables, we first need to create a dataset for each table.

The ORDER_DATA table has attributes like order_id, item_id, customer_id, product_id, amount, currency, and time_stamp (the transaction date and time).

The CUSTOMER_PRIMARY table has customer_id, first_name, and last_name attributes.

To create the datasets, complete the following steps:

  1. On the Datasets page of the DataBrew console, choose Connect new dataset.
  2. For Dataset name, enter a name.
  3. Choose Snowflake as the database connection.
  4. Enter the connection you created (AwsGlueDatabrew-my-new-snowflake-connection).
  5. Verify the JDBC URL shown on the console.
  6. For Table name, enter the name in schema.table format (enter only table if you have already added &schema= in the JDBC URL parameters).
  7. Choose Create dataset.
  8. Repeat these steps for the CUSTOMER_PRIMARY table and name the new dataset customer-primary-data.

You can see both datasets are available to use on the Datasets page.

Create a project using the datasets

To create your DataBrew project, complete the following steps:

  1. On the DataBrew console, choose Create project.
  2. For Project Name, enter my-snowflake-proj.
  3. For Attached recipe, choose Create new recipe.The recipe name is populated automatically.
  1. For Select a dataset, select My datasets.
  2. For Dataset name, select order-dataset.
  3. For Role name, choose your role to be used with DataBrew.
  4. Choose Create project.

You can see a success message along with our Snowflake ORDER_DATA with 500 rows.

After the project is opened, a DataBrew interactive session is created. DataBrew retrieves sample data based on your sampling configuration selection.

Open a Snowflake project and build a transformation recipe

In a DataBrew interactive session, you can cleanse and normalize your data using over 250 built-in transforms. In this post, we use DataBrew to identify premium customers by performing a few transforms and finding customers with order amounts greater than $3,000.

To do this, we perform the following steps:

    1. Batch delete the unneeded columns: ORDER_ID, ITEM_ID, TIME_STAMP, and TRANSACTION_DATE.
    2. Because the AMOUNT column is the data type string, we convert it to float.
    3. We filter the rows based on an AMOUNT value greater than or equal to $3,000 and add the condition as a recipe step.

Now we have all customer orders with amounts greater than $3,000, which are our premium customers. But we don’t know their names. We join the customer-primary-data dataset to find out the names based on the common column CUSTOMER_ID.

  1. Choose Join.
  2. For Select dataset, choose customer-primary.
  3. Choose Next.
  4. For Select join type, select Left join.
  5. For Join keys, choose CUSTOMER_ID for Table A and deselect CUSTOMER_ID for Table B.
  6. Choose Finish.
  7. Choose Schema and move the customer name columns FIRST_NAME and LAST_NAME to the top columns.

The final data shows customer names with orders greater than $3,000.

Run the DataBrew recipe job on the full data

Now that we have built the recipe, we can create and run a DataBrew recipe job.

  1. On the project details page, choose Create job.
  2. For Job name¸ enter premium-customer-orders.

    DataBrew supports writing job outputs into all file formats supported by Snowflake. For this post, we use Parquet as the output format.
  1. For File type, choose PARQUET.
  2. For Role name, choose an existing role or create a new one.
  3. Choose Create and run job.
  4. Navigate to the Jobs page and wait for the premium-customer-orders job to complete.
  5. Choose the Destination link to navigate to Amazon S3 to access the job output.

Bring data from Amazon S3 back into Snowflake

Snowflake offers several methods to bring data from an S3 data lake back into Snowflake, including ways to automate the process or incorporate it into your existing data pipelines. For more information, see Bulk Loading from Amazon S3 and Automating Snowpipe from Amazon S3.

Clean up

Delete the following resources that might accrue cost over time:

  • The recipe job premium-customer-orders
  • The job output stored in your S3 bucket
  • The IAM roles created as part of projects and jobs
  • The DataBrew project my-snowflake-project and its associated recipe my-snowflake-project-recipe
  • The DataBrew datasets and associated AWS Glue connections
  • The uploaded JAR files from your S3 bucket

Conclusion

In this post, we saw how to store credentials securely and create an AWS Glue connection for a Snowflake database. We learned how to use this connection to create a DataBrew dataset for each table, and how to reuse this connection multiple times. We also saw how easily we can bring data from Snowflake into DataBrew and seamlessly apply transformations and run recipe jobs.

DataBrew database support is generally available in the following Regions:

  • US East (Ohio)
  • US East (N. Virginia)
  • US West (N. California)
  • US West (Oregon)
  • Asia Pacific (Mumbai)
  • Asia Pacific (Seoul)
  • Asia Pacific (Singapore)
  • AP Pacific (Sydney)
  • AP Pacific (Tokyo)
  • Canada (Central)
  • Europe (Frankfurt)
  • Europe (Ireland)
  • Europe (London)
  • Europe (Paris)
  • Europe (Stockholm)
  • South America (São Paulo)

For more information about JDBC database support in DataBrew, see Database dataset developer guide. For more information about running and automating DataBrew jobs, see Creating, running, and scheduling AWS Glue DataBrew jobs.


About the Authors

Nitya Sheth is a Software Engineer working on AWS Glue DataBrew. He has also worked on AWS Synthetics as well as on user experience implementations for Database, Analytics, and AI AWS consoles. He divides his free time between exploring new hiking places and new books.

 

 

 

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

 

 

 

 

Credit: Source link