I am going to list some excercises to pratice the concepts and to prepare for the AWS Certified Machine Learning - Associate exam.
Data Preparation
Pre-Training Data Preparation
If we stores historical data in .csv files in Amazon S3 and only some of the rows and columns in the .csv files are populated and the columns are not labeled. What is the best way to prepare the data for training a machine learning model?
Answer: 1, Use AWS Glue to create a crawler that can automatically discover the schema of the .csv files and create a table in the AWS Glue Data Catalog. 2, Use AWS Glue DataBrew for data cleaning and feature engineering. 3, Store the cleaned and transformed data in a format suitable for training, such as Parquet or ORC, in Amazon S3.
The training dataset includes transaction logs, custormer profiles, and tables from an on-premises MySQL database. The transaction logs and customer profiles are stored in Amazon S3.
- Which AWS service or feature can aggregate the data from the various data sources? Answer: AWS Lake Formation. We need to aggreagate data from three data sources and AWS Lake Formation is purpose-built to create and manage centralized data lakes, with native capabilities to ingest and aggregate data from heterogeneous sources including these three sources. It eleiminates the need to build custom ingestion workflows for each source, providing a unified repository of aggregated data that can be directly accessed for ML peprocessing, featuring engineeringm and model training, which directly addresses the core requierment in the senario.
- Propose a solution to automatically detect anomalies in the data and to visualize the result. Answer: Use Amazon SageMaker Data Wrangler to automatically detect the anomalies and to visualize the reults. SageMaker Data Wrangler is purpose-built for ML data preparation, supports all listed data sources, includes ntive automatic anomaly detection capabilities that can identify otliers, calss imbalance, and fature interdependencies and provides integrated visualization tools to display anomaly dection results directly inthe interface without requiring separate service integration.
- The training dataset includes categorical data and numerical data. The ML engineer must prepare the training dataset to maximize the accuracy of the model, propose a solution to handle the categorical data and numerical data. Answer: Use SageMaker Data Wrangler to transform the categorical data into numerical data using one-hot encoding or target encoding, and to scale the numerical data using standardization or normalization techniques. SageMaker Data Wrangler provides built-in transformations for both categorical and numerical data, allowing us to easily prepare the dataset for training while maximizing the accuracy of the model.
- To solve the imbalanced data, propose a solution to handle the imbalanced data in the training dataset. Answer: Use the Amazon SageMaker Data Wrangler balance data operation to oversample the minority class. It has built-in balance operation includes multiple mitigation methods such as SMOTE (Synthetic Minority Oversampling Technique), which addresses feature interdependencies by generating sythetic minority class samples instead of duplicating existing samples, reducing overfitting risk.
Feature Engineering
If we use SageMaker Feature Store to create and manage features to train a model, what are the steps to create a feature group and ingest data into it? Answer: 1, Create a feature group in SageMaker Feature Store, specifying the name, description, and the feature definitions (name, data type, and description for each feature). 2, Use the SageMaker Feature Store API or the SageMaker Python SDK to ingest data into the feature group. This can be done by calling the
put_recordorput_recordsmethod, which allows you to insert individual records or batches of records into the feature group. Each record should include the feature values and a unique identifier (e.g.,record_id) for each record. 3, Once the data is ingested, you can use the feature group to retrieve features for training your machine learning model or for making predictions in real-time.Mapping the feature engineering techniques to the appropriate use cases, we have:
- Feature splitting
- Logarithmic transformation
- One-hot encoding
- Standardized distribution
Field Name Feature Engineering Technique City Name One-hot encoding Type_Year (type of home and year the home was built) Feature splitting Size of the building (square meters) Logarithmic transformation Logarithmic transformation is used to handle skewed data, while one-hot encoding is used to convert categorical variables into a format that can be provided to machine learning algorithms. Feature splitting is used to separate combined features into individual features for better analysis and modeling.
The size of the building (square meters) is likely to have a skewed distribution, so applying a logarithmic transformation can help to normalize the data and make it more compressed and balanced.
SageMaker
Training
The company is experimenting with consecutive training jobs. How can the company minimize infrastructure startup times for these jobs? Answer: Use SageMaker’s “Warm Pool” feature, which allows us to keep a pool of pre-initialized instances ready to handle training jobs. This can significantly reduce the startup time for consecutive training jobs, as the instances are already initialized and ready to use when a new job is submitted.
Versioning
The company needs to use the central model registry to mange different versions of models in the application. Which action will meet this requirement with the LEAST operational overhead? Answer: Use SageMaker Model Registry, which provides a central repository to manage different versions of models and their associated metadata. We can use model groups to organize models.
Model Group: churn-prediction
Version 1
Accuracy: 0.82
Status: Rejected
Version 2
Accuracy: 0.88
Status: Approved
Monitoring
We needs to run an on-demand workflow to monitor bias drift for models that are deployed to real-time endpoints from the application.
Answer: Configure the application to invoke an AWS Lambda function that runs a SageMaker Clarify job. The Lambda function can be triggered on a schedule (e.g., daily or weekly) using Amazon CloudWatch Events. The SageMaker Clarify job will analyze the model’s predictions and input data to detect any bias drift over time. The results can be stored in Amazon S3 or sent to Amazon CloudWatch for further analysis and alerting.
Deployment
The company must implement a manual approval-based workflow to ensure taht only approved models can be deployed to production endpoints.
Answer: Use SageMaker Pipelines. When a model version is registered, we can use the AWS SDK to change the approval status to “Approved”. We can then configure the pipeline to deploy only models with an “Approved” status to production endpoints. This way, we can ensure that only approved models are deployed, and we can maintain control over the deployment process.
ML engineer is configuring a CI/CD pipeline in AWS CodePipeline to deploy the model. The pipeline must ran automatically when new training data fro the model is uploaded to an Amazon S3.
Answer: 1, We can have an S3 event notification invoke the pipeline when new data is uploaded to the specified S3 bucket. This can be done by setting up an event notification on the S3 bucket that listens for “ObjectCreated” events and specifies the CodePipeline as the destination for the event. 2, SageMaker retrains the model by using the data in the S3 bucket. 3, The pipeline deploys the model to SageMaker endpoint.
Inference
During a baseline analysis of model quality, the company recorded a threshold for the F1 score. After several months of no change, the model’s F1 score decreases signigicantly. What could be the reason for the reduced F1 score?
Answer: Concept drift occured in the underlying customer data that was used for predictions. Concept drift refers to the change in the statistical properties of the target variable over time, which can lead to a decrease in model performance if the model is not updated to reflect the new data distribution.
Bedrock
- AI term and description mapping
AI Term Description Token Text representation of basic units of data processed by LLMs. Embedding High-dimensional vectors that contain the semantic meaning of text. Retrieval Augmented Generation (RAG) Enrichment of information from additional data sources to improve a generated response.
Security and Identity
A company has a team of data scientists who use Amazon Sagemaker notebook instance to test ML models. When the data scientists need new permissions, the company attaches the permissions to each individual role that was created during the creation of the Sagemaker notebook instance. The company needs to centralize management of the tream’s permissions, propose a solution to meet this requirement.
Answer: Create a single IAM role that has the necessary permissions. Attach the role to each notebook instance that the team uses.
One thing worth noting for exam context: This is different from using IAM Groups, which apply to IAM users, not to AWS services like SageMaker. SageMaker notebook instances authenticate via IAM roles (service roles), so the correct centralization mechanism here is a shared role — not a group.