AutoML using AWS SageMaker Autopilot
Automated Machine Learning (AutoML) is a trending subfield in the industry today providing explainable and reproducible results. It has democratized machine learning in reality as it aims at simplifying and automating the full machine-learning process. Cloud providers are definitely diving into this and providing us with several platforms and services that can be leveraged to spend less time on low-level details and more time on using ML to improve business outcomes. AWS provides a service for AutoML called AWS SageMaker Autopilot which provides features like automatic data pre-processing, feature engineering, automatic ML model selection, Model leaderboard, feature importance, customizable AutoML journey, and automatic notebook creation. You can read more about the service here and about its pricing here.
In this article, I will show how you can practically use Autopilot for the full machine-learning process from data analysis to deploying the best model as an endpoint step-by-step. I will be using a sample dataset for demo purposes, the idea here is to understand how we can leverage the AutoML service provided by AWS. Log into your AWS account, if you don’t have one, you can set up a free tier account here.
Step-1(Setting up data & S3 bucket): First we need to create an S3 bucket. From the left top services option in the AWS Console search for S3 and go to it. Click ‘Create Bucket’ and give it a custom name and you can leave the other setting as default or provide them as per your requirements, for demo purposes, I will be leaving everything as default. Inside the bucket create 2 folders, one ‘input’ inside which upload the data, and the other ‘output’ where the artifacts will be stored later. (all this setup can be seen in the gif below)
Step-2(Setting up Autopilot experiment): From the left top services option in your AWS Console search for SageMaker and go to it, on the left sidebar you will see ‘Studio’ Click on it, and it will take you to the studio homepage where you will find ‘Create a SageMaker Domain’, go ahead and create a domain it will be required to start ‘Studio’, you can find how to set up a SageMaker Domain in my previous article here in step one. After the domain is created successfully go to the dropdown and click ‘Studio’ to open AWS SageMaker Studio. In my case, I already have a domain setup previously so I will use the same to open my Studio (can be seen in the gif below). It might take a few minutes to Sstart ‘Studio’. On the studio home screen the Launcher tab opened by default for you, on the left sidebar you can find different options that can be utilized, in this article we will be exploring the ‘AutoML’ tab. Click on the ‘AutoML’ tab. It will take you to another tab where it will list down all the AutoML experiments. Start creating the experiment by clicking on ‘Create AutoML Experiment’, give the experiment a name, then you need to connect your S3 bucket data file inside the input folder and also provide the output folder location, either provide the path or find the browse S3 bucket option available. You can select the ‘Auto Split’ option or provide custom splitting of the dataset. Move to the next page, here you need to select your target variable, and autopilot automatically detects the features in the data as soon as you provide the location of the data. Move to the next page, here you can select ‘Training Algorithms’ of your choice or select Auto, on selecting auto the autopilot will automatically consider the best-suited set of algorithms to experiment with according to the data. For demo purposes, I will be choosing the Hyperparameter optimization section and selecting Linear Models and XGBoost, you can experiment with any set of algorithms according to your use case. Go to the next page, you will find a toggle button for ‘Auto Deploy’, what this will do is automatically select the best model and deploy it as an endpoint, you can choose if you want to do that, I will be switching it off to showcase how you can deploy manually later, there are other ‘Advanced Settings’ you can choose to set according to your requirement, I will be choosing ‘Max Candidates’ as 5 and leave others as default, the larger the number of candidates the more time it takes to create models. Finally, review all the settings and click on ‘create experiment’. It may take several minutes depending on the number of candidates and data size. Autopilot will be doing pre-processing, feature engineering, model tuning, generating candidate definitions, explainability, and insights reports. (the whole process can be seen in the gif below)
Step-3(Analyze the generated experiment): After the training is complete, you can find all the models that Autopilot has trained and it will also mark the Best Model automatically. Just above the models, it provides a metrics summary and the name of the best model. You can also find the job details by switching to the tab ‘Job Profile’ tab, here all the information is provided regarding the problem type, input and output data config, location, attributes, and the best candidate model, etc. Back to the ‘Trials’ tab, on the right side of the best model summary you can find ‘View model details’ Click there and it will open the model details tab, here you can find the four tabs ‘Explainability, Performance, Artifacts, Network’, go ahead and explore all the four tabs. In the ‘Explainability Tab’, you will find feature importance, metrics, and hyperparameter details. These details can be downloaded as a feature report from the right top corner. In the ‘Performance Tab’, you can find the metrics table and the model quality report and can download it in the same way. In the ‘Artifacts Tab’, you can find all the artifact links like input data, feature engineering code, model, algorithm model, training and validation splits, etc. And ‘Network Tab’ shows network details. If you go back to the ‘Trials’ tab, you will find ‘Open candidate generation notebook’ and ‘Open data exploration notebook’ above the ‘View Model Details’ button, open these notebooks to find the code and explanations. In the ‘Data Exploration Notebook you can find detailed explanations of target analysis, cross-column statistics, missing values, duplicate rows, descriptive stats, etc. And in the ‘Candidate Generation Notebook’ you can find the code and details about generating selecting and executing candidates pipelines, model selection, deployment, etc. (experiment exploration can be seen in the gif below)
Step-4(Deploying and invoking endpoint for real-time prediction): In the ‘Trials’ tab select the best model and on the top right corner you can find the ‘Share model’ and ‘Deploy Model’ buttons, click on the Deploy model button and provide a custom endpoint name, select the instance type, I will choose ‘ml.t2.medium’ instance count to be ‘1’ and click on deploy model. Now if you go to the AWS SageMaker service and on the left sidebar scroll down to find the Inference dropdown, open the Endpoints and you will see that the endpoint is being created. Once it is InService you can use it to get the predictions. Now back to the Studio, you see there are different tabs in the endpoint interface, data quality, model quality, model bias, aws settings, etc. You can set up input data capture and model quality capture in order to monitor your ML model as the requests arrive at the endpoint. Now to generate predictions let’s open a notebook, from the top left go to File -> New -> Notebook, it will open an empty notebook and prompt you to select notebook environment, select the image as ‘Data Science’, kernel as ‘Python3’ and instance as ‘ml.t3.medium’, I already have a notebook created and code setup to invoke endpoint and generate predictions in real-time using a custom input for demo purposes. You can also make batch predictions and set up data capture in order to assess your model performance, data, and concept drift over time. The whole process of endpoint deployment to real-time prediction can be seen in the gif below. s
Important!! You need to delete the endpoint and all the instances, apps, and kernels, running in your studio. If not, AWS will bill you for these resources. To save yourself from extra unintentional costs go to the AWS SageMaker Console -> Inferences -> Endpoints select the endpoint and go to actions and hit delete. Coming back to Studio from the left sidebar select the resources and click on the Quit icon. (the process can be seen below)
This should help you get started with the Autopilot service practically and leverage it for your own use case. If you found this helpful follow for more such content. You can also connect with me on LinkedIn to discuss more on Data Science and MLOPs.