Using AWS SageMaker Ground Truth for Data Annotation/Labeling task in NER
AWS provides a wide variety of cloud services from infrastructure to machine learning to organizations which help them in lowering capital investment and increasing time to value with consistent productivity. In Data Science and AI domain AWS provides managed service to easily build, train, and deploy ML models into a production-ready hosted environment.
Data Labeling is an integral part of the machine learning process workflow when working with Text, Image or Video files because it helps in adding an informative label to provide context to unstructured data like text, images, videos, etc. so that an ML algorithm can use it to learn from it. AWS SageMaker Ground Truth is a part of AWS SageMaker service which helps in labeling data. It provides several labeling tasks which can be seen in the image below, you can read in detail about all the labeling tasks provided in the service here.
In this article, I will show you how to set up Ground Truth for labeling in a Named Entity Recognition task. Other labeling tasks can be set up in a similar manner. Before getting started, make sure you have an AWS free tier account if not go ahead and create one here. AWS Ground Truth is free for the first 2 months, first 500 objects labeled per month are free, you can find more about the pricing details for Ground Truth here.
Setting up the text labeling job for NER using AWS SageMaker Ground Truth
Step-1: We need to set up an S3 bucket and store the data which we want to be annotated. All you have to do is search ‘S3’ in the services, open S3 and click on Create bucket, give it a name of your choice and leave others settings as default, however, if you want to block public access to your bucket you can do so according to your requirement, I will be leaving everything at default. After the bucket is created you can go upload your file inside the bucket. You can see how to do all this in the below GIF.
Step-2: In the services search for AWS SageMaker and click on it. It will take you to a page looking something like below. On the left sidebar, you will see Ground Truth click the drop-down to find Labeling jobs.
Click on Labeling jobs and it will take you to the homepage where you will see a button named ‘Create labeling job’. Give the job a custom name and select the S3 bucket that we just created above, in the data type select text, however, you can see there are other options as well, you can explore them but for this use case, we will be using text. Create a new IAM role and in the ‘Specific S3 bucket’ again type the name of the S3 bucket that you created, after the role has been successfully created go ahead and click on ‘Complete data setup’. This step is important, do not forget to do this. Scroll down to select NER. Move to the next page in Step 2 you will see three worker types go ahead and select Private, you can give a custom name for the team and provide the emails of all the employees/people to whom you want to assign this labeling task. Assign the task timeout and expiration. Fill in the other information similarly. Scroll to the bottom you will see the NER labeling tool, you need to provide the description of the task here which will be shown to the people who will be working on the task. You also need to add Labels you can add your custom labels, in this case, I will be adding four labels which are Job Role, Monetary Value, Organization and Location. Go ahead and click on ‘Create’. All of this process can be seen in the GIF below.
Above we selected the Private worker type and we gave the email of the people to whom we want to assign the labeling task. In my case, I gave my other email ID in order to simulate a scenario and show you how it looks from the perspective of a labeler. So after you see the labeling job is created and it shows ‘in progress’. This means that the email with the credentials to log in has been sent to the labeler to whom we assigned the task. The email looks like this.
After you go to the link in your email and log in with the provided credentials and reset your password, you will see a screen like this.
Also if you go to your AWS SageMaker and on the left sidebar select Labeling Workforce in the private section you can see the worker whom we added previously, the email id, and the status as verified.
Next, I will show you from the perspective of a labeler how you go ahead and perform the labeling task. This can be seen in the GIF below.
Now if you to your Labeling Job and go inside it tell you that labeling has been completed and the related information is displayed.
AWS Ground Truth creates two manifest files, one called ‘input manifest’ is created when you create the labeling job. And one ‘output manifest’ which contains the metadata and the labels that are labeled by the workers. Both these files are present in the S3 bucket that you specified while creating the labeling job. You can download these and see the information inside.
The output manifest file looks like this. You can see the labeled information and also the confidence that is 0 in our case because we have set up private workers and labeling is done by a human. Ground Truth also provides a way to automatically label data using a model which is trained to do so by AWS. In that case, it would provide the confidence with which it has labeled the data.
You can see in your AWS SageMaker console that now the labeling job status is shown as complete.
There is also AWS Ground Truth Plus which helps to create high-quality training datasets without having to build labeling applications or manage labeling workforces on your own. You can read more about it here.
Similarly, you can go ahead and set up other labeling tasks according to your use case. This concludes the article. If you found this helpful follow for more such content. You can also connect with me on LinkedIn to discuss more on Data Science and MLOPs or in case you have any questions.