Deployment Safeguards

Deployment Guardrails are for synchronous or real-time inference endpoints. It can control shifting traffic to new models and Auto-rollbacls.

  • Canary deployment: It is a deployment strategy that gradually rolls out new changes to a small subset of users before making it available to the entire user base. This allows us to monitor the performance and behavior of the new changes in a controlled environment and quickly roll back if any issues arise.
  • Blue-green deployment: It is a deployment strategy that involves maintaining two separate environments (blue and green) for the application. The blue environment is the current production environment, while the green environment is the new version of the application. When the new version is ready, we can switch traffic from the blue environment to the green environment, allowing for a seamless transition with minimal downtime.
  • A/B testing: It is a method of comparing two versions of a web page or application against each other to determine which one performs better. It involves randomly assigning users to either the control group (A) or the treatment group (B) and measuring the performance of each version based on predefined metrics.

Shadow Tests can compare performance of shadow variant to production and we can monitor in SageMaker console and decide when to promote it.

SageMarker in Production

All models in Sagemaker are hosted in Docker containers

  • Pre-built deep learning
  • Pre-built scikit-learn and Spark ML
  • Pre-built Tensorflow, MXNet, Chainer, Pytorch and can have distributed training via Horovod or Parameter Servers.
  • We can also have our own training and inference code or extend a pre-build image for sepecific purpose. In this way, we can use any script or library we want, and we can also use any framework we want and docker contains all the dependencies we need.
graph TD
    subgraph AWS_Cloud [ML Environment]
        Training_Jobs[Training jobs]
        
        Model_Training[Model Training <br/>'Docker container']
        
        Model_Deployment[Model Deployment <br/>'Docker container']
        
        Models[Models]
        Endpoints[Endpoints]
    end

    S3_Training[(S3 Training data)] --> Model_Training
    
    ECR[Amazon ECR <br/>'Docker images'] <--> Model_Training
    ECR --> Model_Deployment

    Model_Training --> Training_Jobs
    Training_Jobs --> S3_Artifacts[(S3 Model artifacts)]
    
    S3_Artifacts --> Model_Deployment
    Model_Deployment --> Models
    Models --> Endpoints
    
    %% External Traffic
    Endpoints <--> External_User(( ))

Docker containers are created from images, images are built from a Dockerfile , and images are stored in Amazon ECR (Elastic Container Registry). When we create a training job, we specify the Docker image to use for the training job. The training job will pull the specified image from Amazon ECR and run it on the training data stored in S3. After the training job is completed, the model artifacts are stored in S3. We can then create a model deployment using the same Docker image and the model artifacts from S3. The model deployment will create an endpoint that can be accessed by external users for inference.

The actual structure of a SageMaker training container looks like this:

/opt/ml
├── input
│   ├── config
│   │   ├── hyperparameters.json
│   │   └── resourceConfig.json
│   └── data
│       └── <channel_name>
│           └── <input data>
├── model
|   └── <model files> (for inference)
├── code
│   └── <script files> (for training)
└── output
    └── failure

Struture of Docker image

  • Workdir
    • nginx.conf: The nginx.conf file is a configuration file for the Nginx web server. It defines how Nginx should handle incoming requests, route them to the appropriate application, and manage various aspects of the server’s behavior, such as load balancing, caching, and security settings.
    • predictor.py : The predictor.py file is responsible for handling incoming inference requests and generating predictions using the trained model. It typically contains code to load the model, preprocess input data, and return predictions in the desired format.
    • serve/: This directory contains the necessary files and configurations to serve the model for inference. It may include a WSGI (Web Server Gateway Interface) application, such as a Flask or FastAPI app, that listens for incoming requests and routes them to the predictor.py for processing.
    • train/: This directory contains the training script and any necessary files for training the model. It may include a training script (e.g., train.py) that defines the training logic, data loading, and model architecture. Additionally, it may contain configuration files or scripts for setting up the training environment, such as installing dependencies or configuring hyperparameters.
    • wsgi.py: Invoked by the WSGI server to start the application. It typically imports the predictor.py and initializes the necessary components to handle inference requests.

Production Variants

We can test out multiple models on live traffic using Production Variants. Variant Weights tell Sagemaker how to distribute traffic among them so, we could roll out a new iteration of the model at say 10% variant weight and once we are satisfied with the performance, we can increase the weight to 100% and make it the default model.

This lets us do A/B test, and to validate performance in real-world settings.

Managing Sagemaker Resources

Training and Inference Instance Types

We can use instance type to control the compute resources. For training, we can use P3, g4dn and for inference, we can use ml.c5 which is less computationally intensive and GPU instances can be really expensive.

EC2 spot instances can save up to 90% of the cost of on-demand instances. However, they can be interrupted by AWS with a two-minute warning when AWS needs the capacity back, so we need to save the checkpoint of the training job to S3 so that we can resume the training job later when the spot instance is interrupted. Spot instances can increase training time as we need to wait for spot instances to become available.

Automatic Scaling

Aws support automatic scaling for SageMaker endpoints. We can set up auto-scaling policies based on metrics such as CPU utilization, memory utilization, or custom metrics. We can use CloudWatch to monitor these metrics and trigger scaling actions. This allows the endpoint to automatically scale up or down based on the incoming traffic and resource utilization, ensuring that the endpoint can handle varying workloads efficiently while optimizing costs. We need always load and test the configuration of auto-scaling to make sure it works as expected.

SageMaker automatically attempts to distribute instances across availability zones but we need to has more than one instance to make this happen. So it is recommended to have multiple instances for each production endpoint and if we ahve VPC, we need to have at least two subnets in different availability zones to ensure high availability and fault tolerance.

Model Deployment

Deploying Models for Inference

there are three ways to deploy models for inference in SageMaker:

  • SageMaker JumpStart: Deploying pre-trained models to pre-configured endpoints with just a few clicks. It provides a library of pre-trained models and example notebooks to help us get started quickly.
  • ModelBuilder: It is from the Sagemaker python SDK and it provides a high-level interface for building and deploying machine learning models. It allows us to define our model architecture, training configuration, and deployment settings in a simple and intuitive way.
  • AWS CloudFormation: It is a service that allows us to define and provision AWS infrastructure as code. us can use CloudFormation templates to automate the deployment of SageMaker models and endpoints. It is for advanced users who want to have more control over the deployment process and integrate it with other AWS services. This allows us to track changes in Git and redeploy the entire stack instantly.

Different inference options

  • Real-time inference: It is for applications that require low latency and immediate responses. It allows us to deploy our models as RESTful APIs, which can be accessed by external applications for real-time predictions.

  • Amazon SageMaker Serverless Inference: It is a fully managed service. It is ideal if workload has idle periods and uneven traffic over time, and can tolerate cold start latency.

  • Asynchronous Inference: It queues requests and processes them asynchronously. We use it for large payload sizes (up to 1GB) with long processing times, but near-real-time latency requirements.

  • Autoscaling: Dynamically adjust compute resources for enpoints based on traffic.

  • Sagemaker Neo: Optimizes models for AWS Inferentia chips, which can provide significant performance improvements for inference workloads.

Sagemaker Serverless Inference

We need to specify the container, memory requirement, concurrency requirements. The underlying infrastructure is automatically provisioned and scaled based on the incoming traffic. It is chareged based on the number of invocations and it will scale down to zero when there are no requests. It is monitored via CloudWatch for modelSetup Time, Invocations, Memory Utilization. It is fully managed serverless endpoints for machine learning inference with pay-per-use pricing.

SageMaker Inference Recommender

SageMaker Inference Recommender is a tool that helps us optimize the performance of our machine learning models for inference. It provides recommendations on how to configure our SageMaker endpoints to achieve the best performance based on our specific use case and requirements.

For instance recommends, we need to register the model to deploy to the model registry, and then we can use the Inference Recommender to run a series of tests on different instance types and configurations to determine the optimal setup for our model.The metrics collected during the tests include latency, throughput, and cost. Running load tests on recommended instance types take about 45 minnutes to complete. There are also endpoit recommandations

For Endpoint recommendations, we can can have custom load test. We can specify the number of instances, traffic patterns, latency requirements, throughput requirements, and cost constraints. The Inference Recommender will then analyze the performance of the endpoint under different configurations and provide recommendations on how to optimize it for our specific use case, like the number of instances, auto-scaling policies, initial variant weights. The process may takes about 2 hours.

Inference Pipelines

Inference pipelins allow us to chain linear sequence of 2-15 containers together to perform inference. For example:

  • Container 1 (Pre-processing): Takes raw JSON, fills in missing values, and scales numbers (e.g., using Scikit-learn).
  • Container 2 (Prediction): Takes the cleaned data and runs the actual ML model (e.g., XGBoost or PyTorch).
  • Container 3 (Post-processing): Takes the raw probability (0.87) and converts it into a human-readable string (“High Risk”).

SageMaker supports Spark ML (via Glue or EMR) and Scikit-learn containers. It utilizes the MLeap format serialization framework for Spark ML to enable high-performance deployment of these models directly within SageMaker.

Inference pipelines can handle both real-time inference and batch inference.

SageMaker Model Monitor

The idea is to get alerts on quality deviations on the deployed models (via CloudWatch).

It can visualize the data distribution and detect data drift, model performance drift, and feature importance drift. e.g., if the distribution of input data changes significantly from the training data, it may indicate that the model is no longer performing well and needs to be retrained or updated. The salary has increased significantly last 5 years due to the inflation, so the model trained on data from 5 years ago may not perform well on current data.

It can also dectect anomalies and outliers, new features.

The monitoring data is stored in S3 and monitoring jobs are scheduled via a monitoring schedule.

Metrics are emmited to CloudWatch and we can set up alarms to make notifications and then we can take corrective actions, such as retraining the model or audit the data.

Model monitor also integrates with Tensorboard, QuickSight, Tableau. We can also visualize the monitoring data in SageMaker Studio.

SageMaker Clarify

SageMaker Clarify is a tool that helps us detect bias in our machine learning models. It provides insights into the fairness of our models by analyzing the data and the model’s predictions.

It can identify potential bias in the training data, such as imbalanced classes or underrepresented groups, and it can also analyze the model’s predictions to identify any disparities in performance across different demographic groups. This allows us to take corrective actions to mitigate bias and ensure that our models are fair and equitable.

It also helps us understand the importance of different features in our model and how they contribute to the predictions, which can help us identify any potential sources of bias and take steps to address them.

Monitoring Types

  • Drift in data quality: It can detect changes in the distribution of input data compare to the baseline you created and the “quality” is the statical properties of the features.
  • Drift in model performance: It can detect changes in the performance of the model over time, such as changes in accuracy, precision, recall, or other relevant metrics. This can help us identify when the model’s performance is degrading and take corrective actions, such as retraining the model or updating it with new data.
  • Bias drift: It can detect changes in the fairness of the model’s predictions across different demographic groups. This can help us identify when the model is becoming less fair and take corrective actions to mitigate bias and ensure that our models are equitable.
  • Feature importance drift: It can detect changes in the importance of different features in the model’s predictions. This can help us identify when certain features are becoming more or less influential in the model’s predictions and take corrective actions to address any potential issues. It is based on Normalized Discounted Cumulative Gain (NDCG) score and this compares feature ranking of training vs. live data.

Model Monitor Data Capture

The Data capture logs inputs to the endpoint and inference outputs to S3 as JSON file. These data can be used for further training, debugging, and monitoring. It can automatically compares data metrics to the baseline. It supported for both real-time and batch model monitor modes. It is supported for Python (Boto) and SageMaker Python SDK. The inference data may be encrypted.

MLOps with SageMaker and Kubernetes

SageMaker natively supports whole model lifecycle management from data preprocessing, model training, model evaluation, model registration, model deployment, and model monitoring. However, some organizations may prefer to use Kubernetes for their MLOps workflows due to its flexibility and scalability or maybe some part of the workflow is on on-premises infrastructure. In this case, SageMaker can be integrated with them. We need to integrate sagemaker with Kubernetes-based ML infrastructure. There are some approaches to achieve this integration:

  1. SageMaker Operators for Kubernetes: AWS provides SageMaker Operators for Kubernetes, which allows us to manage SageMaker resources directly from Kubernetes.
  2. Components for Kubeflow Pipelines: We can use Kubeflow Pipelines to orchestrate our MLOps workflows and integrate SageMaker as a component within those pipelines. This allows us to leverage the capabilities of both platforms and create a seamless workflow for our machine learning projects.

These methods enable hybrid ML workflows (on-premises and cloud) and allow us to leverage the strengths of both platforms for our MLOps needs. So we can use Kubernetes for orchestration and SageMaker for model training, deployment, and monitoring, creating a powerful and flexible MLOps workflow that can scale with our needs.

flowchart LR

    subgraph Kubernetes Cluster
        A[EKS Control Plane]

        subgraph Worker Nodes
            B[EC2 Node<br>Kubelet]
            C[Kubernetes Apps]
        end

        D[SageMaker Operator]
    end

    subgraph SageMaker Platform
        E[Training Jobs]
        F[Batch Transform]
        G[Inference Endpoints]
    end

    A --> B
    A --> D

    B --> C

    D --> E
    D --> F
    D --> G

There are also Sagemaker components for Kubeflow Pipelines, which allow us to use SageMaker for specific steps in our Kubeflow Pipelines, such as Processing, Heperparameter Tuning, Training and Inference.

SageMaker Projects

SageMaker Projects is a SageMaker Studio’s native MLOps solution with CI/CD.

  1. Buid images
  2. Prepare data, feature engineering
  3. Train models
  4. Evaluate models
  5. Deploy models
  6. Monitor and update models

It uses code repositories for building and deploying ML solutions and it uses SageMaker Pipelines defining steps.

flowchart TB

%% =======================
%% MODEL BUILD PIPELINE
%% =======================

DS[Data Scientist commits code] --> Repo1[Repository #1<br/>Model building code]

Repo1 --> EB1[Amazon EventBridge]
EB1 --> model_build
subgraph model_build [CodePipeline Model Build]
CB1[AWS CodeBuild<br/>Run SageMaker Pipeline] --> SM_PIPE
subgraph SM_PIPE [SageMaker Pipelines]
    direction TB
    P1[Processing Job<br/>Data Preprocessing]
    T1[Training Job]
    P2[Processing Job<br/>Model Evaluation]
    R1[Register Model]

    P1 --> T1 --> P2 --> R1
end

P2 --> S3[(Amazon S3<br/>Model Artifacts)]
R1 --> REG[Model Registry<br/>SageMaker Model Registry]
end
DS -- Data Scientist approves model --> REG

%% =======================
%% MODEL DEPLOY PIPELINE
%% =======================

REG --> EB2[Amazon EventBridge]

MLOPS[MLOps Engineer updates deployment] --> Repo2[Repository #2<br/>Model deployment code]
Repo2 --> model_deploy
EB2 --> model_deploy
subgraph model_deploy [CodePipeline Model Deploy]

CB2[AWS CodeBuild<br/>Build CloudFormation templates for deployment]
CB2 --> CF1[AWS CloudFormation<br/>Deploy Staging Endpoint]

CF1 --> STAGING[SageMaker Hosting<br/>Staging Endpoint]
STAGING --> TEST[AWS CodeBuild<br/>Test Staging Endpoint]

TEST --> APPROVAL{Manual Approval}

APPROVAL -->|Approved| CF2[AWS CloudFormation<br/>Deploy Production Endpoint]
CF2 --> PROD[SageMaker Hosting<br/>Production Endpoint]
end

ECS

EC2 Launch Type

ECS is Elastic Container Service, It allows us to run and manage Docker containers on a cluster of EC2 instances. When we launch docker containers on AWS, we launch ECS Tasks on ECS Clusters.

When we use EC2 launch type, we need to manage the underlying EC2 instances ourselves. Each EC2 instance must run the ECS agent to register in the ECS Cluster and AWS can take care of starting and stopping containers.

graph TD
    subgraph ECS_Cluster ["Amazon ECS / ECS Cluster"]
        direction TB
        
        NewContainer["New Docker Container"]
        
        subgraph EC2_1 ["EC2 Instance"]
            direction TB
            C1_1["Docker Container"]
            C1_2["Docker Container"]
            Agent1["ECS Agent (Docker)"]
        end
        
        subgraph EC2_2 ["EC2 Instance"]
            direction TB
            C2_1["Docker Container"]
            C2_2["Docker Container"]
            C2_3["Docker Container"]
            Agent2["ECS Agent (Docker)"]
        end
        
        subgraph EC2_3 ["EC2 Instance"]
            direction TB
            C3_1["Docker Container"]
            C3_2["Docker Container"]
            Agent3["ECS Agent (Docker)"]
        end
        
        %% Connections
        NewContainer --> EC2_1
        NewContainer --> EC2_2
        NewContainer --> EC2_3
    end

Fargate Launch Type

We can just launch Docker containers on AWS and AWS just runs ECS task for us based on the CPU/RAM we need. To scale up, we just need to launch more tasks and AWS will take care of the rest. It is a serverless compute engine for containers.

IAM Roles for ECS