Democratizing Data Preparation with AWS Glue DataBrew
By Federico Allocati Federico & Jonathan Zambiazzo, Machine Learning Engineers @ Edrans
As an AWS Global Consulting Premier Partner, we at Edrans get the possibility to attend AWS re:Invent, an annual event that brings the latest product launches and service updates to its partners. In the 2020 edition, AWS launched Glue DataBrew, a pretty new serverless and fully-managed service for visual data preparation.
It seeks to both streamline data engineering tasks for advanced users and bring these capabilities closer to new users without a technical background.
In this post, we’ll take a look at this new service, and show you most of its main features and capabilities, including storing a group of Transformations as a Recipe, launching Jobs from these Recipes, defining Data Quality rules to be run on Data Profiling Jobs (to sanity check the data), and more!
Introducing Glue DataBrew
Unlike other data processing services offered by AWS, DataBrew was born with the goal of making the latest data engineering capabilities available to any user, without the need to write a single line of code. Both for performing simple exploratory data analysis tasks, and for most ETL needs, such as cleaning and preparation of the data.
With our recently achieved AWS Machine Learning Competency, we know by first hand how laborious and demanding the data preparation stage can be. With Glue DataBrew we will have access to most of the commonly used transformations such as:
- Handling of null values
- Filtering, splitting, merging, grouping and joining the data
- Data type formatting and date parsing
- Deduplication of records and outlier detection
- Easy creation of new variables via integrated functions such as aggregate & window functions, text processing, mathematical functions, and so on
But also, as part of the Data Analytics suite of services, Glue DataBrew is integrated with all major AWS Data Sources, such as connecting to a repository or datalake in S3, a DataWarehouse on Amazon Redshift, and Databases such as AuroraDB and other RDS. It’s also tightly integrated with the AWS Glue Catalog.
Glue DataBrew Walkthrough
The first thing that we need to do is connect to our working Dataset, give it a name, and select the data source and format. As you can see below, there are many different available options for the input sources such as:
- Straight upload of the dataset file
- Connect to S3 directly or through AWS Lake Formation
- Connect to a DBMS by using JDBC
- Connect to AWS Glue Data Catalog
- Connect to data in many other places such as AWS Data Exchange, Amazon Redshift, Amazon AppFlow, AWS Snowflake
In this example, we choose data from a bucket into S3 by placing the input path and defining the format of the files (JSON, CSV, PARQUET or EXCEL at the time of writing).
Note that connections are always read-only, DataBrew reads it and leaves it data in place.
A project in DataBrew stores all the information necessary to visually explore, understand, combine, clean, and normalize the data. It is the workspace to hold information about your data, the transformations that you make to it, and the jobs that you schedule to transform it.
A project is linked to a Dataset, and has an attached data transformation Recipe which will define the desired data cleaning steps to perform on the data. You can create a new one, or attach an existing one.
Note that when working interactively, DataBrew builds a sample of data from our dataset. The default sample size is 500. Smaller sample sizes allow for faster transformations, saving time as you develop the Recipes; whereas larger sample sizes more accurately reflect the makeup of the underlying source data, at the cost of slower session initialization and interactive transformations.
Also, as usual with managed AWS services, you have to create/choose the Role that will execute the connection to the data, and you can apply Tags to the AWS resources as needed.
A Data Profile is a report that examines the data to provide information about its shape, context, structure, and relationships, along with statistics and visualizations most commonly used for exploratory data analysis.
First, we can see a summary of the dataset, correlations, and boxplots for the numeric variables. Following that, we get an overview of the statistics for each individual column in the dataset, including a histogram with the distribution.
In the case of numeric values, it shows us a boxplot chart that can help us detect anomalies and get a better feel of the value distribution, and many summaries such as minimum and maximum values, quartiles, mean and median.
For string variables, the available statistics are: count of unique values, distinct values, minimum and maximum string length.
To get a more detailed description of each feature, the next stop is the Column Statistics tab.
A particularly useful widget on this tab is the data insights.
To generate a data profile, you must run a Data Profiling Job on the desired dataset. For this you have to configure the desired profiling configuration (both Dataset level and Column level configurations), and set the number of compute nodes, IAM role, job timeout and retries, etc.
Finally, you can associate up to 2 schedules to automatically run the profiling job, and tag with your metadata.
Data Quality Rules
In order to ensure our data meets a minimum quality criteria, we could define a set of rules that will be validated prior to ingestion from the data sources.
We will try setting a rule ensuring our Age variable may be between a minimum and a maximum value. In case it doesn’t pass the condition, a log will be generated with the alert.
The results of the execution of these quality rules will be seen within a Validation Report as logs in JSON format, where you can see a detail of each rule, configurations, and if you do not pass specify which rules in the set were not satisfied by the input data.
Once the DQ Ruleset is created, it can be attached to a Profiling Job, which allows users to automatically perform the required data sanity checks each time the Profile is run.
Besides data exploration and validation, Glue DataBrew excels at data transformation tasks. To do this, the first step is to open the Recipe associated with the current project, which will open a DataBrew interactive session.
When the session is loaded, we can start adding transformations that will be sequentially applied to our dataset. Every transformation we make is added as an Applied Step within the Recipe, which allows us to easily chain, reorder and manipulate them.
Below we will show a couple of the different transformations available, beginning with a simple Merge of 2 columns (FIRST and LAST) to create a new FULLNAME variable. You can optionally specify a separator string, to insert between the merged values.
caveat of the Merge transformation is that it drops the source columns. If you want to preserve them, one option is to use the Create transform instead.
Next we want to generate age groups (Young, Adult, Senior) using the Age variable. To accomplish this, DataBrew brings us the Binning transformation, that allows us to bin and discretize numeric data (either by fixed ranges, or by percentage of values on each bin).
A nice bonus of this Transformation is that, when setting it up, it displays a histogram of the variable, which can be used to pick the proper ranges for binning.
Finally, one of the most common transformation steps needed for machine learning is to map categorical values to a range of indexes, to be used either as input features or as the target to predict. To achieve this we will use the Categorically Mapping transformation, mapping the GENDER variable to numbers, mapping male to 1 and female to 2, on a new column GENDER_CLASS.
DataBrew also offers us the possibility to publish our recipes, which allows us to share our work with other collaborators, and version the Recipes for proper tracking of the data lineage.
Finally, once we have defined our transformations and published our Recipe, we can move on to the creation of a job that will execute it.
Pretty much the only thing you need to define is the output location and format. Some possible destinations are Amazon S3, Amazon Redshift, an AWS Snowflake warehouse or a JDBC connection.
When working with S3, DataBrew offers a wide variety of formats (CSV, PARQUET, AVRO, ORC, XML, JSON, among others), as well as the possibility of compression like gzip and snappy.
Besides that, like on any other Job, we need to configure the number of compute nodes, IAM roles, job timeouts, etc. Also, like with Profiling Jobs, schedules can be attached to the Job to automate it.
It’s important to notice that Glue DataBrew jobs are separated from standard Glue jobs.
After working with the tool ourselves, and experimenting with it, we gathered some insights, for users to keep in mind when using DataBrew:
- Profiling Jobs + DQ-Rules are a great way of getting a detailed report of the input data
- Schedule functionality allows us to periodically run jobs
- When exporting to S3, you can automatically create the schema in Glue DataCatalog
- Transformations supports regex
- It is super easy to reorder the different steps within a recipe, or remove any of them
- It supports the use of KMS service for hashing and encryption of columns, but there are extra costs incurred
- There aren’t (yet) any transformations or functions for Geospatial data
- Some of the transformations drop the source column, (e.g. the MERGE Transformation, use instead the CREATE COLUMN if you want to keep the source columns)
- The Recipes can be exported, but in a custom DSL format (JSON or YAML), and it’s not straightforward to implement it on Spark or Python, which might be a huge limitation in order to port the transformation steps to be used in Glue or other platforms
The data science and engineering technology stack grows more complex by the day and, at Edrans, we try to continuously expand our expertise to stay ahead of emerging methods and tools. AWS Glue DataBrew’s functionalities and ease of use were truly outstanding to us.
At first sight, the UI feels modern and clean, the toolbar is intuitive (everything is where you expect it to be), and the side-bar help is always concise and spot on. Props to DataBrew’s user experience team.
Since it can take less than 5 minutes to have everything set up and ready in an interactive session, we consider that it definitely accomplishes the goal of introducing Data Preparation to every kind of user, no matter the technical background, prior experience in data engineering or coding skills.
The integration with the wide variety of data sources and destinations, along with the number of ready-to-use transformations is another point to remark.
It would be great that in the near future there was a symbiosis or the possibility of integrating DataBrew Recipes with the classic Glue, as they are currently totally independent services (besides branding).
We consider that DataBrew is a bet by AWS in the right direction, expanding the range of users, and facilitating the creation of tedious data cleaning jobs under a friendly interface. We are definitely looking forward to seeing how the product and its adoption evolve.
For further digging on the service, AWS provides a full tutorial on the service, with a short walk through that covers all the main functionalities of the service. Also, don’t forget to check the service´s pricing site.
Reach to us in case you have questions on DataBrew or for any other data & machine learning skills that you may need to empower your business. Edrans is part of a small select group of AWS Partners worldwide with Premier status, operating primarily in the LATAM region.