Data Preparation Workflows using AWS SageMaker Data Wrangler

Abdurrahman
6 min readApr 18, 2023

--

In the whole ML project lifecycle data preparation is the most essential, resource-intensive, and time taking task. Data preparation consists of multiple steps like data cleaning, transformation, feature engineering, etc., and this takes about 80% of the time. Aws provides a service called AWS SageMaker Data Wrangler it simplifies the process of data preparation and feature engineering, and complete each step of the data preparation workflow. It contains over 300 built-in data transformations so you can quickly transform data without writing any code. It simplifies the ETL job. Data workflows can be exported from the data wrangler to a notebook or script so it can be automated with SageMaker Pipelines. You can read more about this service here. Aws provides 25 hours of ml.m5.4xlarge instance in the free tier you can read more about pricing here.

The idea here is to understand how to leverage the Aws SageMaker Data Wrangler service provided by AWS. I will be using the famous Titanic dataset for demo purposes including a step-by-step explanation of how you can practically use data wrangler. Log into your AWS account, if you don’t have one, you can set up a free tier account here.

Step-1(Setting up S3 bucket & uploading the data): First we need to create an S3 bucket and upload our data. From the left top services option in the AWS Console search for S3 and go to it. Click ‘Create Bucket’ and give it a custom name and you can leave the other setting as default or provide them as per your requirements, for demo purposes, I will be leaving everything as default. Inside the bucket upload the data (all this setup can be seen in the gif below).

Step-2(Setting up Data Wrangler): From the left top services option in your AWS Console search for SageMaker and go to it, on the left sidebar you will see ‘Studio’ Click on it, and it will take you to the studio homepage where you will find ‘Create a SageMaker Domain’, you can go ahead and create a domain it will be required to start ‘Studio’, you can see how to set up a SageMaker Domain in my previous article here in step one. After your domain is created you can go to the dropdown and click ‘Studio’ to open AWS SageMaker Studio. In my case, I already have a domain setup previously so I will use the same to open my Studio (can be seen in the gif below). It might take a few minutes to start ‘Studio’. On the studio home screen you will find a Launcher tab opened by default for you, on the left sidebar you can find different options that can be utilized, in this article we will be exploring the ‘Data -> Data Wrangler’ tab. Click on the ‘Data’ dropdown to find the Data Wrangler. It will take you to another tab click on ‘Create dat wrangler flow’, it might take several minutes to set up and start. The ‘Create Connection’ screen will be made available to you where there are different sources from where you can import your data, in our case it is S3, click on S3 to find the bucket you created and the dataset inside it, as soon as you select the data file it will detect the columns and details click on import. You will see three tabs ‘Data’, ‘Analysis’, and ‘Training’. The Data tab is where you see the data and on the right-hand side, you can add transformation steps. The Analysis tab is where you can do all the data analysis, feature scaling, and many more things. On the Training tab, you can train a model and export the data and model. (the process of setting up data wrangler can be seen in the gif below)

Step-3(Analysis Tab-Data Analysis): If you click on the ‘Data Flow’ in the top left side above the Data Tab, you can see the visual workflow of the data preparations steps, here you will add different analyses and transforms to create a full data preparation and analysis workflow which can be exported as well. In the flow click on the + sign on the Data Types and click ‘Add Analysis’. It will take you to the Analysis tab and on the right sidebar you can find ‘Analysis Type’ There are many different types of analysis like Data Quality reports, Bias reports, and visualizations like Histograms, Feature Importance, Table summary, etc., to utilize according to your use case. For demo purposes, I will show a few analyses and how to add them to the flow. (can be seen in the gif below)

Step-4(Data Transformations): If you are on the ‘Analysis’ tab just go to the ‘Data’ tab or go back to the data flow and click on the + sign and selectAdd transformon the right sidebar click on ‘Add Step’ you notice that there are many transforms that are made available for you automatically by data wrangler on a click basis. You can scroll and explore all the transformations available. For demo purposes, I will be showing some of the transformations and how data flow will look once these transformations are added (can be seen in the gif below). After adding the required transformations you can export your data, train a model on your data, add the destination S3 bucket to run training and save data, add the transformed data to aws feature store, export your data workflow to S3 via jupyter notebook, export to sagemaker pipelines, python code, etc. If you click on export to Amazon S3 via Jupyter Notebook there will be a jupyter notebook generated for you with all the steps from configuring input and output sources to exporting the data flow results to an S3 bucket, towards the end of the jupyter notebook you will also find the optional codes which will enable you to train a sagemaker model (all the above-mentioned process can be seen in the gif below).

You can try running the cells from the start till the ‘Job Status & S3 Output Location’ and leave out any optional code in between and towards the end. After you execute these cells you will see an output location, you can go there and find the resultant transformed csv file created for you. If you want you can try running the optional code as well and train a sagemaker model but it might incur some charges if you exceed the service charge limit.

Important!! You need to delete all the instances, apps, and kernel sessions, running in your studio. If not, AWS will bill you for these resources. To save yourself from extra unintentional costs, select from the left sidebar the resources and click on the Quit icon (the deletion process can be seen below).

This should help you get started with the Aws SageMaker Data Wrangler service to create your own data preparation workflows. If you found this helpful follow for more such content. You can also connect with me on LinkedIn to discuss more on Data Science and MLOPs.

--

--

Abdurrahman
Abdurrahman

Written by Abdurrahman

Data Scientist II @ Elsevier

No responses yet