Before you can create visuals and dashboards that convey useful information, you need to transform and prepare the underlying data. The range and complexity of data transformation steps required depends on the visuals you would like in your dashboard. Often, the data transformation process is time-consuming and highly iterative, especially when you are working with large datasets.
In this post, we show you how to apply data transformations on COVID-related tweets (with sentiment scores) using AWS Glue DataBrew. AWS Glue DataBrew’s visual and interactive interface allows us to apply data transformations without any coding. Some examples of transformations we apply are: changing date formats, transformation of text strings and performing table pivots. We then use Amazon QuickSight to visualize the transformed data in a dashboard.
See the AWS architecture diagram below for an overview of the complete and serverless data pipeline.
These are seven main steps in the data pipeline:
- Upload CSV file with tweets to S3
- Trigger AWS Lambda to partition the dataset in case the number of rows exceeds a threshold, and store the output in S3. A threshold is placed to prevent a memory exception failure to compute VADER sentiment score.
- Use Amazon SQS and AWS Lambda to compute VADER sentiment scores for tweets and store the VADER output as parquet files in S3
- Crawl the VADER output and produce a Data Catalog using AWS Glue, so that the data can be queried easily using Amazon Athena
- Use the Data Catalog connector in AWS Glue DataBrew to create a recipe and perform transformations
- Setup a Recipe job to perform the transformation and send the output to an S3 location; repeat step 5 to produce a data catalog of the transformed data
- Query and submit the data to Amazon QuickSight to create visuals for the dashboard
Each section below corresponds to one (or more) of the seven steps described above.
Steps 1 to 3: Setting and Deploying AWS Lambda pipeline
To test the solution we can use the AWS CloudFormation template found here. The AWS CloudFormation template automatically creates the following for you: the S3 Bucket to store the CSV and parquet files, the Lambda function to partition the files, the SQS queue and the VADER Lambda function.
Deploying the CloudFormation Template
- Create an S3 bucket with the name of your choice and note the bucket name. Please refer to this guide for detailed steps on how to create an S3 bucket.
- Upload all four individual zip files to the S3 bucket created in step 1. To do so, download the zip file from here, make sure to unzip the downloaded file and place each of the four individual zip files into the S3 bucket created in step 1.
- Open the AWS Management Console in the AWS Region you want to deploy the solution to, and on the Services menu, choose CloudFormation.
- Choose Create Stack, choose Upload a template to Amazon S3, and then choose the file databrew-cloudformation.yaml included in the solution that you downloaded earlier and if not, click on the link provided on this step to download.
- Set Stack name to databrew-stack. Specify the Amazon S3 bucket that contains the compressed version of AWS Lambda function code and layers uploaded in step 2.
- For Options, you can specify tags for your stack and an optional IAM role to be used by AWS CloudFormation to create resources. If the role isn’t specified, a new role is created. You can also perform additional configuration for rollback settings and notification options.
- The review section shows a recap of the information. Be sure to select the three AWS CloudFormation acknowledgements to allow AWS CloudFormation to create resources with custom names on your behalf. Also, create a change set, because the AWS CloudFormation template includes the AWS::Serverless-2016-10-31
- Click Execute.
- The Outputs for the stack lists all the resources created.
Downloading the dataset
- Download the dataset from here.
Testing the solution
- Navigate to the S3 console.
- Click on the bucket name (created by CloudFormation template and listed in the Outputs section of the stack).
- Create folder named input in the S3 bucket.
- Upload the downloaded dataset to the folder created in step 3.
- The above will trigger a lambda function to chunk the files into smaller files and proceed to another lambda function triggered by an SQS event to perform the following preprocessing steps: removal of https links, VADER sentiment scores of the 4 categories (compound, positive, negative, and neutral), and saving the file as a parquet file in S3 the path called processedv4.
Step 4: Setup AWS Glue Data Catalog
An AWS Glue Data Catalog will allows us to easily import data into AWS Glue DataBrew. Follow these steps to create a Glue crawler that crawls the the raw data with VADER output in partitioned parquet files in S3 and determines the schema:
- Choose a crawler name.
- Use the default options for Crawler source type.
- Provide the S3 location of the parquet files.
- Choose/Create an IAM role that has read/write permissions to S3 and the AWSGlueServiceRole policy attached.
- Set the frequency as Run on demand.
- Choose a database name for the crawler’s output.
- Leave all remaining options as default.
Once the crawler has been created, select and run the new crawler. Upon completion, the table schema is generated and visible within Tables under the Databases section in the AWS Glue console.
Step 5: Creating a DataBrew project
To get started with AWS Glue DataBrew, complete the following steps:
- On the DataBrew console, choose Projects.
- Choose Create project.
- For Project Name, enter covid19-tweets.
- For Attached recipe, choose Create new recipe.
- For Recipe name, enter covid19-tweets-recipe.
- For Select a dataset, select New dataset.
- For Dataset name, enter covid19-tweets-data.
- Import the AWS Glue table <YOUR-GLUE-TABLE-NAME> from the AWS Glue database <YOUR-GLUE-DATABASE-NAME>.
- Create a new AWS Identity and Access Management (IAM) policy and IAM role by following the steps on the AWS Glue DataBrew console, which provides DataBrew the necessary permissions to access Amazon S3, Amazon Athena and AWS Glue.
- Select Create Project.
Step 6: Data transformations, creating a AWS Glue DataBrew recipe and recipe job
Exploratory data analysis (EDA) is an essential component of the data transformation workflow. EDA allows us to gain an intuitive understanding of the dataset by summarizing its main characteristics such as the distribution of data across columns, the corresponding data types and summary statistics. For more information on EDA using AWS Glue DataBrew, see the Exploratory Data Analysis section of the post Data preprocessing for machine learning on Amazon EMR made easy with AWS Glue DataBrew.
The set of data transformation steps listed below are based on the EDA and the attributes we would like to visualize in the Amazon QuickSight dashboard.
- Delete the following columns
use_name, because the aforementioned columns are not dependent for Amazon QuickSight dashboard.
Based on the schema view of the dataset, we can observe that the date column is represented as a string. AWS Glue DataBrew provides various transformations date-time formats to convert date types.
- Select from the dropdown select Date functions, then DATEFORMAT.
- Create a new column based on the values in the source column.
- For source column, select date.
- For date format, select mm/dd/yy*HH:MM.
- Delete the date column and rename the date_DATEFORMAT column to date.
A visual representation of the positive, negative, neutral, and compound scores generated by the VADER AWS Lambda function is helpful to understand the sentiments of the tweets over time. Individual scores could have large variation between rows leading to large fluctuations on the line graph. We can use the rolling window transformation over observations in a column, to create a smoother, less fluctuating line graph. To apply the rolling window function, create a less granular version of the date column that is truncated of the hours and minute.
- Select From the dropdown select Date functions, then DATEFORMAT.
- Create a new column based on the values in the source column
- For source column, select date
- For date format, select mm/dd/yy to transform the observations at a date-level
- For destination column, enter date_level.
- Select From the dropdown select Window functions, then ROLLING AVERAGE.
- For source column, select compound
- Select 3 rows before and after. This is the size of the rolling window that slides over the column and impacts the smoothness of the average score from one row to another; larger window sizes produce more smoother rolling average scores and vice-versa.
- Order the observations by the date column because it’s granularity is at the hour- and minute-level
- Group the observations by the date_level column, which is only represented in date without the time
- For the destination column name, enter rolling_compound.
- Repeat step 5 above for each of the source columns pos, neu and neg to create rolling_pos, rolling_neu and rolling_neg.
To support downstream visuals in Amazon QuickSight, apply the unpivot transformation to convert the selected columns into row values.
- Select Pivot transformation
- For pivot type, select Unpivot: Columns to rows
- In the unpivot columns dropdown, select the columns
rolling_compound, rolling_pos, rolling_neuand
- For column name, enter Rolling_category and for column value enter VADER_score
The hashtags column contains an array of hashtags. To visualize the hashtags as a word cloud in Amazon QuickSight, create a new row for each hashtag from the array by applying the unnest transformation.
- Select unnest
- For source column, select hashtags
- For the destination column name, enter unnest_hashtags
Normalize the data in the unnest_hashtags column
- Format the data in unnest_hashtags column to lowercase
- Remove all special characters by selecting the clean transformation
The data preparation phase is now complete, and the set of 20 transformations that consist of date formatting, rolling window functions and normalized columns are combined into a recipe.
The transformations above were applied on a sample of the first 500 rows of the dataset. AWS Glue DataBrew recipe job provides the ability to scale the set of transformation steps from a sample of data to the entire dataset.
To create a recipe job, complete the following steps:
- On the AWS Glue DataBrew console, choose Jobs.
- Choose Create recipe job.
- For Job name, enter a name.
- Create a new folder in Amazon S3 for the recipe job output, select the file type as GLUEPARQUET and compression as Snappy.
If your dataset is updated on a regular basis, AWS Glue DataBrew provides an option to schedule jobs.
To query the newly transformed data from S3 into Amazon QuickSight, create another new crawler/table in AWS Glue similar to steps provided earlier (refer to the following section: Step 4: Setup an AWS Glue Data Catalog).
Use AWS QuickSight to visualize transformed data
Before proceeding, make sure your Amazon QuickSight account has IAM permissions to access Amazon Athena and S3. Add a new dataset by clicking on Datasets from the left panel and click on New dataset. Select Athena from the list of data sources and provide a name for the Data source name and on the next section. When prompted, select the database and table that contains the table post AWS Glue DataBrew transformation and click Use custom SQL followed by changing the New custom SQL name to TableOrderedData and paste the following SQL query before selecting Edit/Preview data:
Replace <db> with your database and <table_name> with your table name. Leave the quotes in place.
Click Apply to make some further modifications.
- For the date column, change the data type from Stringto Date and provide the format the date as it is presented in the column (i.e. MM/dd/yy HH:mm). Similarly, change date_level column into Date
- Click Save and visualize to approach the next step, which is to analyze the data. In the Analyses section, select Horizontal bar chart under Visual types followed by clicking vader_score and rolling_category. Under Field wells, change vader_score (Sum) to vader_score (Average) as the Aggregate.
- For this visual, select Actions and select Filter same-sheet visuals under Quick create.
- Create another panel by selecting Add to create a line graph that allows rolling score. Select Line chart under Visual types and select the following in the presented order: vader_score, date, and rolling_category. Under Field wells, change vader_score (Sum) to vader_score (Average) as the Aggregate. Likewise, change date (Day) to date (Hour) as the Aggregate. Similar to the previous panel, create a quick action item.
- Further panels can also be created, such as word cloud of hashtags or a pivot table consisting of the different VADER categories with score and date. The following image is based on selecting the rolling_pos bar on the first quadrant, which filters and cascades to the rest of the panels to that of the rolling_pos filter.
- First quadrant depicts the overall average of each category of the entirety of the dataset. The second quadrant depicts the rolling average that is aggregated by the hour. The third quadrant is table representative that is depicted as the rolling average that is aggregated by the day. The word cloud panel is from the hashtags column.
In this post, we showed you how to interactively analyze and visualize a dataset using AWS Lambda, AWS Glue Databrew and Amazon QuickSight. We began with a COVID19 tweets dataset and computed their sentiment scores using the VADER algorithm. We then cleaned and prepared the tweets and their sentiment scores in AWS Glue DataBrew, and finally visualized key characteristics of the dataset in Amazon QuickSight.
We encourage you to extend this pipeline to your data analytics + visualization use case – there are many more pre-built transformations in AWS Glue DataBrew and pre-built visuals in Amazon QuickSight to explore.
About the Authors
Prithiviraj Jothikumar, PhD, is a Data Scientist with AWS Professional Services, where he helps customers build solutions using machine learning. He enjoys watching movies and sports and spending time to meditate.
Kartik Kannapur is a Data Scientist with AWS Professional Services. He holds a Master’s degree in Applied Mathematics and Statistics from Stony Brook University and focuses on using machine learning to solve customer business problems.
Bala Krishnamoorthy is a Data Scientist with AWS Professional Services, where he helps customers solve problems and run machine learning workloads on AWS. He has worked with customers across diverse industries, including software, finance, and healthcare. In his free time, he enjoys spending time outdoors, running with his dog, beating his family and friends at board games and keeping up with the stock market.
Paritosh Walvekar is a Cloud Application Architect with AWS Professional Services, where he helps customer build cloud native applications. He has a Master’s degree in Computer Science from University at Buffalo. In his free time, he enjoys watching movies and is learning to play the piano.
Credit: Source link