Deployment Safeguards

Deployment Guardrails are for synchronous or real-time inference endpoints. It can control shifting traffic to new models and Auto-rollbacls.

  • Canary deployment: It is a deployment strategy that gradually rolls out new changes to a small subset of users before making it available to the entire user base. This allows us to monitor the performance and behavior of the new changes in a controlled environment and quickly roll back if any issues arise.
  • Blue-green deployment: It is a deployment strategy that involves maintaining two separate environments (blue and green) for the application. The blue environment is the current production environment, while the green environment is the new version of the application. When the new version is ready, we can switch traffic from the blue environment to the green environment, allowing for a seamless transition with minimal downtime.
  • A/B testing: It is a method of comparing two versions of a web page or application against each other to determine which one performs better. It involves randomly assigning users to either the control group (A) or the treatment group (B) and measuring the performance of each version based on predefined metrics.

Shadow Tests can compare performance of shadow variant to production and we can monitor in SageMaker console and decide when to promote it.

SageMarker in Production

All models in Sagemaker are hosted in Docker containers

  • Pre-built deep learning
  • Pre-built scikit-learn and Spark ML
  • Pre-built Tensorflow, MXNet, Chainer, Pytorch and can have distributed training via Horovod or Parameter Servers.
  • We can also have our own training and inference code or extend a pre-build image for sepecific purpose. In this way, we can use any script or library we want, and we can also use any framework we want and docker contains all the dependencies we need.
graph TD
    subgraph AWS_Cloud [ML Environment]
        Training_Jobs[Training jobs]
        
        Model_Training[Model Training <br/>'Docker container']
        
        Model_Deployment[Model Deployment <br/>'Docker container']
        
        Models[Models]
        Endpoints[Endpoints]
    end

    S3_Training[(S3 Training data)] --> Model_Training
    
    ECR[Amazon ECR <br/>'Docker images'] <--> Model_Training
    ECR --> Model_Deployment

    Model_Training --> Training_Jobs
    Training_Jobs --> S3_Artifacts[(S3 Model artifacts)]
    
    S3_Artifacts --> Model_Deployment
    Model_Deployment --> Models
    Models --> Endpoints
    
    %% External Traffic
    Endpoints <--> External_User(( ))

Docker containers are created from images, images are built from a Dockerfile , and images are stored in Amazon ECR (Elastic Container Registry). When we create a training job, we specify the Docker image to use for the training job. The training job will pull the specified image from Amazon ECR and run it on the training data stored in S3. After the training job is completed, the model artifacts are stored in S3. We can then create a model deployment using the same Docker image and the model artifacts from S3. The model deployment will create an endpoint that can be accessed by external users for inference.

The actual structure of a SageMaker training container looks like this:

/opt/ml
├── input
│   ├── config
│   │   ├── hyperparameters.json
│   │   └── resourceConfig.json
│   └── data
│       └── <channel_name>
│           └── <input data>
├── model
|   └── <model files> (for inference)
├── code
│   └── <script files> (for training)
└── output
    └── failure

Struture of Docker image

  • Workdir
    • nginx.conf: The nginx.conf file is a configuration file for the Nginx web server. It defines how Nginx should handle incoming requests, route them to the appropriate application, and manage various aspects of the server’s behavior, such as load balancing, caching, and security settings.
    • predictor.py : The predictor.py file is responsible for handling incoming inference requests and generating predictions using the trained model. It typically contains code to load the model, preprocess input data, and return predictions in the desired format.
    • serve: This directory contains the necessary files and configurations to serve the model for inference. It may include a WSGI (Web Server Gateway Interface) application, such as a Flask or FastAPI app, that listens for incoming requests and routes them to the predictor.py for processing.
    • train: This directory contains the training script and any necessary files for training the model. It may include a training script (e.g., train.py) that defines the training logic, data loading, and model architecture. Additionally, it may contain configuration files or scripts for setting up the training environment, such as installing dependencies or configuring hyperparameters.
    • wsgi.py: Invoked by the WSGI server to start the application. It typically imports the predictor.py and initializes the necessary components to handle inference requests.

Production Variants

We can test out multiple models on live traffic using Production Variants. Variant Weights tell Sagemaker how to distribute traffic among them so, we could roll out a new iteration of the model at say 10% variant weight and once we are satisfied with the performance, we can increase the weight to 100% and make it the default model.

This lets us do A/B test, and to validate performance in real-world settings.

Managing Sagemaker Resources

Training and Inference Instance Types

We can use instance type to control the compute resources. For training, we can use P3, g4dn and for inference, we can use ml.c5 which is less computationally intensive and GPU instances can be really expensive.

EC2 spot instances can save up to 90% of the cost of on-demand instances. However, they can be interrupted by AWS with a two-minute warning when AWS needs the capacity back, so we need to save the checkpoint of the training job to S3 so that we can resume the training job later when the spot instance is interrupted. Spot instances can increase training time as we need to wait for spot instances to become available.

Automatic Scaling

Aws support automatic scaling for SageMaker endpoints. We can set up auto-scaling policies based on metrics such as CPU utilization, memory utilization, or custom metrics. We can use CloudWatch to monitor these metrics and trigger scaling actions. This allows the endpoint to automatically scale up or down based on the incoming traffic and resource utilization, ensuring that the endpoint can handle varying workloads efficiently while optimizing costs. We need always load and test the configuration of auto-scaling to make sure it works as expected.

SageMaker automatically attempts to distribute instances across availability zones but we need to has more than one instance to make this happen. So it is recommended to have multiple instances for each production endpoint and if we have VPC, we need to have at least two subnets in different availability zones to ensure high availability and fault tolerance.

Model Deployment

Deploying Models for Inference

there are three ways to deploy models for inference in SageMaker:

  • SageMaker JumpStart: Deploying pre-trained models to pre-configured endpoints with just a few clicks. It provides a library of pre-trained models and example notebooks to help us get started quickly.
  • ModelBuilder: It is from the Sagemaker python SDK and it provides a high-level interface for building and deploying machine learning models. It allows us to define our model architecture, training configuration, and deployment settings in a simple and intuitive way.
  • AWS CloudFormation: It is a service that allows us to define and provision AWS infrastructure as code. us can use CloudFormation templates to automate the deployment of SageMaker models and endpoints. It is for advanced users who want to have more control over the deployment process and integrate it with other AWS services. This allows us to track changes in Git and redeploy the entire stack instantly.

Different inference options

  • Real-time inference: It is for applications that require low latency and immediate responses. It allows us to deploy our models as RESTful APIs, which can be accessed by external applications for real-time predictions.

  • Amazon SageMaker Serverless Inference: It is a fully managed service. It is ideal if workload has idle periods and uneven traffic over time, and can tolerate cold start latency.

  • Asynchronous Inference: It queues requests and processes them asynchronously. We use it for large payload sizes (up to 1GB) with long processing times, but near-real-time latency requirements.

  • Autoscaling: Dynamically adjust compute resources for enpoints based on traffic.

  • Sagemaker Neo: Optimizes models for AWS Inferentia chips, which can provide significant performance improvements for inference workloads.

Sagemaker Serverless Inference

We need to specify the container, memory requirement, concurrency requirements. The underlying infrastructure is automatically provisioned and scaled based on the incoming traffic. It is chareged based on the number of invocations and it will scale down to zero when there are no requests. It is monitored via CloudWatch for modelSetup Time, Invocations, Memory Utilization. It is fully managed serverless endpoints for machine learning inference with pay-per-use pricing.

SageMaker Inference Recommender

SageMaker Inference Recommender is a tool that helps us optimize the performance of our machine learning models for inference. It provides recommendations on how to configure our SageMaker endpoints to achieve the best performance based on our specific use case and requirements.

For instance recommends, we need to register the model to deploy to the model registry, and then we can use the Inference Recommender to run a series of tests on different instance types and configurations to determine the optimal setup for our model.The metrics collected during the tests include latency, throughput, and cost. Running load tests on recommended instance types take about 45 minnutes to complete. There are also endpoit recommandations

For Endpoint recommendations, we can can have custom load test. We can specify the number of instances, traffic patterns, latency requirements, throughput requirements, and cost constraints. The Inference Recommender will then analyze the performance of the endpoint under different configurations and provide recommendations on how to optimize it for our specific use case, like the number of instances, auto-scaling policies, initial variant weights. The process may takes about 2 hours.

Inference Pipelines

Inference pipelins allow us to chain linear sequence of 2-15 containers together to perform inference. For example:

  • Container 1 (Pre-processing): Takes raw JSON, fills in missing values, and scales numbers (e.g., using Scikit-learn).
  • Container 2 (Prediction): Takes the cleaned data and runs the actual ML model (e.g., XGBoost or PyTorch).
  • Container 3 (Post-processing): Takes the raw probability (0.87) and converts it into a human-readable string (“High Risk”).

SageMaker supports Spark ML (via Glue or EMR) and Scikit-learn containers. It utilizes the MLeap format serialization framework for Spark ML to enable high-performance deployment of these models directly within SageMaker.

Inference pipelines can handle both real-time inference and batch inference.

SageMaker Model Monitor

The idea is to get alerts on quality deviations on the deployed models (via CloudWatch).

It can visualize the data distribution and detect data drift, model performance drift, and feature importance drift. e.g., if the distribution of input data changes significantly from the training data, it may indicate that the model is no longer performing well and needs to be retrained or updated. The salary has increased significantly last 5 years due to the inflation, so the model trained on data from 5 years ago may not perform well on current data.

It can also dectect anomalies and outliers, new features.

The monitoring data is stored in S3 and monitoring jobs are scheduled via a monitoring schedule.

Metrics are emmited to CloudWatch and we can set up alarms to make notifications and then we can take corrective actions, such as retraining the model or audit the data.

Model monitor also integrates with Tensorboard, QuickSight, Tableau. We can also visualize the monitoring data in SageMaker Studio.

SageMaker Clarify

SageMaker Clarify is a tool that helps us detect bias in our machine learning models. It provides insights into the fairness of our models by analyzing the data and the model’s predictions.

It can identify potential bias in the training data, such as imbalanced classes or underrepresented groups, and it can also analyze the model’s predictions to identify any disparities in performance across different demographic groups. This allows us to take corrective actions to mitigate bias and ensure that our models are fair and equitable.

It also helps us understand the importance of different features in our model and how they contribute to the predictions, which can help us identify any potential sources of bias and take steps to address them.

It can run before training to detect dataset bias, after training to explain model predictions using SHAP values, and in production to monitor bias drift. It is commonly used to build responsible and transparent AI systems.

It is explicitly triggered as a job in specific stages of a SageMaker ML workflow like:

SageMaker Processing Job → Clarify configured container runs analysis
DataPrep → Train → Evaluate → Clarify → Register Model → Deploy

We can use lambda fucntion to automate the process of running SageMaker Clarify after model deployment to continuously monitor for bias drift in production. For example, we can set up a CloudWatch Event to trigger a Lambda function on a schedule (e.g., daily or weekly) that initiates a SageMaker Processing Job with the Clarify container to analyze the inference data and compare it to the training data for bias drift detection.

Event (S3 upload / API / schedule)
        ↓
AWS Lambda
        ↓
SageMaker CreateProcessingJob API
        ↓
Clarify container runs bias / SHAP analysis
        ↓
S3 outputs (reports)

Monitoring Types

  • Drift in data quality: It can detect changes in the distribution of input data compare to the baseline you created and the “quality” is the statical properties of the features.
  • Drift in model performance: It can detect changes in the performance of the model over time, such as changes in accuracy, precision, recall, or other relevant metrics. This can help us identify when the model’s performance is degrading and take corrective actions, such as retraining the model or updating it with new data.
  • Bias drift: It can detect changes in the fairness of the model’s predictions across different demographic groups. This can help us identify when the model is becoming less fair and take corrective actions to mitigate bias and ensure that our models are equitable.
  • Feature importance drift: It can detect changes in the importance of different features in the model’s predictions. This can help us identify when certain features are becoming more or less influential in the model’s predictions and take corrective actions to address any potential issues. It is based on Normalized Discounted Cumulative Gain (NDCG) score and this compares feature ranking of training vs. live data.

Model Monitor Data Capture

The Data capture logs inputs to the endpoint and inference outputs to S3 as JSON file. These data can be used for further training, debugging, and monitoring. It can automatically compares data metrics to the baseline. It supported for both real-time and batch model monitor modes. It is supported for Python (Boto) and SageMaker Python SDK. The inference data may be encrypted.

MLOps with SageMaker and Kubernetes

SageMaker natively supports whole model lifecycle management from data preprocessing, model training, model evaluation, model registration, model deployment, and model monitoring. However, some organizations may prefer to use Kubernetes for their MLOps workflows due to its flexibility and scalability or maybe some part of the workflow is on on-premises infrastructure. In this case, SageMaker can be integrated with them. We need to integrate sagemaker with Kubernetes-based ML infrastructure. There are some approaches to achieve this integration:

  1. SageMaker Operators for Kubernetes: AWS provides SageMaker Operators for Kubernetes, which allows us to manage SageMaker resources directly from Kubernetes.
  2. Components for Kubeflow Pipelines: We can use Kubeflow Pipelines to orchestrate our MLOps workflows and integrate SageMaker as a component within those pipelines. This allows us to leverage the capabilities of both platforms and create a seamless workflow for our machine learning projects.

These methods enable hybrid ML workflows (on-premises and cloud) and allow us to leverage the strengths of both platforms for our MLOps needs. So we can use Kubernetes for orchestration and SageMaker for model training, deployment, and monitoring, creating a powerful and flexible MLOps workflow that can scale with our needs.

flowchart LR

    subgraph Kubernetes Cluster
        A[EKS Control Plane]

        subgraph Worker Nodes
            B[EC2 Node<br>Kubelet]
            C[Kubernetes Apps]
        end

        D[SageMaker Operator]
    end

    subgraph SageMaker Platform
        E[Training Jobs]
        F[Batch Transform]
        G[Inference Endpoints]
    end

    A --> B
    A --> D

    B --> C

    D --> E
    D --> F
    D --> G

There are also Sagemaker components for Kubeflow Pipelines, which allow us to use SageMaker for specific steps in our Kubeflow Pipelines, such as Processing, Heperparameter Tuning, Training and Inference.

SageMaker Projects

SageMaker Projects is a SageMaker Studio’s native MLOps solution with CI/CD.

  1. Buid images
  2. Prepare data, feature engineering
  3. Train models
  4. Evaluate models
  5. Deploy models
  6. Monitor and update models

It uses code repositories for building and deploying ML solutions and it uses SageMaker Pipelines defining steps.

flowchart TB

%% =======================
%% MODEL BUILD PIPELINE
%% =======================

DS[Data Scientist commits code] --> Repo1[Repository #1<br/>Model building code]

Repo1 --> EB1[Amazon EventBridge]
EB1 --> model_build
subgraph model_build [CodePipeline Model Build]
CB1[AWS CodeBuild<br/>Run SageMaker Pipeline] --> SM_PIPE
subgraph SM_PIPE [SageMaker Pipelines]
    direction TB
    P1[Processing Job<br/>Data Preprocessing]
    T1[Training Job]
    P2[Processing Job<br/>Model Evaluation]
    R1[Register Model]

    P1 --> T1 --> P2 --> R1
end

P2 --> S3[(Amazon S3<br/>Model Artifacts)]
R1 --> REG[Model Registry<br/>SageMaker Model Registry]
end
DS -- Data Scientist approves model --> REG

%% =======================
%% MODEL DEPLOY PIPELINE
%% =======================

REG --> EB2[Amazon EventBridge]

MLOPS[MLOps Engineer updates deployment] --> Repo2[Repository #2<br/>Model deployment code]
Repo2 --> model_deploy
EB2 --> model_deploy
subgraph model_deploy [CodePipeline Model Deploy]

CB2[AWS CodeBuild<br/>Build CloudFormation templates for deployment]
CB2 --> CF1[AWS CloudFormation<br/>Deploy Staging Endpoint]

CF1 --> STAGING[SageMaker Hosting<br/>Staging Endpoint]
STAGING --> TEST[AWS CodeBuild<br/>Test Staging Endpoint]

TEST --> APPROVAL{Manual Approval}

APPROVAL -->|Approved| CF2[AWS CloudFormation<br/>Deploy Production Endpoint]
CF2 --> PROD[SageMaker Hosting<br/>Production Endpoint]
end

SageMaker Model Registry

SageMaker Model Registry is a fully managed AWS service purpose-built for cataloging ML models, managing model versions, tracking lineage and metadatam and implementing deployment approval workflows, with native integration across the entire SageMaker ML lifecycle. It provides a centralized repository for storing and managing machine learning models, making it easier for data scientists and ML engineers to collaborate and manage their models effectively.

The Sagemaker Model Registry includes built-in approval status fields for all model versions: Pending manual Approval, Approved, and Rejected. When a model is registered as part of a pipeline run , teams can configure the pipeline to pause for manual review, then use the AWS SDK to update the model version’s status to Approved once review is complete. Deployment steps in the pipeline can be restricted to only run for approved model versions, directly meeting the manual approval requierment.

SageMaker Model Groups

A core construct within Sagemaker Model Registry that groups all version of a single use case model, automates version numbering, and simplifies organization of related model artifacts, metadata, and deployment history without custom configuration.

ECS

EC2 Launch Type

ECS is Elastic Container Service, It allows us to run and manage Docker containers on a cluster of EC2 instances. When we launch docker containers on AWS, we launch ECS Tasks on ECS Clusters.

When we use EC2 launch type, we need to manage the underlying EC2 instances ourselves. Each EC2 instance must run the ECS agent to register in the ECS Cluster and AWS can take care of starting and stopping containers.

graph TD
    subgraph ECS_Cluster ["Amazon ECS / ECS Cluster"]
        direction TB
        
        NewContainer["New Docker Container"]
        
        subgraph EC2_1 ["EC2 Instance"]
            direction TB
            C1_1["Docker Container"]
            C1_2["Docker Container"]
            Agent1["ECS Agent (Docker)"]
        end
        
        subgraph EC2_2 ["EC2 Instance"]
            direction TB
            C2_1["Docker Container"]
            C2_2["Docker Container"]
            C2_3["Docker Container"]
            Agent2["ECS Agent (Docker)"]
        end
        
        subgraph EC2_3 ["EC2 Instance"]
            direction TB
            C3_1["Docker Container"]
            C3_2["Docker Container"]
            Agent3["ECS Agent (Docker)"]
        end
        
        %% Connections
        NewContainer --> EC2_1
        NewContainer --> EC2_2
        NewContainer --> EC2_3
    end

Fargate Launch Type

We can just launch Docker containers on AWS and AWS just runs ECS task for us based on the CPU/RAM we need. To scale up, we just need to launch more tasks and AWS will take care of the rest. It is a serverless compute engine for containers.

IAM Roles for ECS

  • EC2 Instance Profile: This is only in EC2 launch type and this role is used by ECS agent to:

    • Make API calls to ECS service.
    • Send container logs to CloudWatch Logs.
    • Pull Docker image frm ECR.
    • Reference sensitive data in Secrets Magager or SSM Parameter Store.
  • ECS Task Role: This role is used by the containers running in the ECS tasks to:

    • Allows each task to have a specific role.
    • Use different roles for the different ECS Services.
    • Task role is defined in the task definition.
  • Load Balancer Intergration

    We can use Application Load Balancer (ALB) or Network Load Balancer (NLB) to distribute traffic to the containers running in the ECS tasks. ALB supports and works for most use cases and NLB is recommended only for high throughput/high performance use cases, or to pair it with AWS Private Link for secure access to services running in ECS.

Data Volumes EFS

We can mount EFS (Elastic File System) onto ECS tasks and it works for both EC2 and Fargate launch types. Tasks running in any AZ can access the same EFS file system, which allows us to share data between tasks and persist data beyond the lifecycle of a single task. We can get a totally serverless architecture by using Fargate launch type and EFS for data storage. While S3 can not be mounted as a file system.

ECR

Elastic Container Registry (ECR) is a fully managed Docker container registry that makes it easy for developers to store, manage, and deploy Docker container images.

It supports image vulberablity scanning, versioning, image tags, image lifecycle policies, and it is integrated with AWS Identity and Access Management (IAM) for access control. We can use ECR to store and manage our Docker images, and then use those images to deploy our applications on ECS or Lambda.

EKS

Amazon Elastic Kubernates Service is a way to launch managed Kubernetes cluster on AWS. Kubernates is an open-source container orchestration platform that automates the deployment, scaling, and management of containerized applications.

EKS is an alternative to ECS and it has similar goal but different API. EKS support both EC2 and Fargate launch types. EKS is recommended if the company is already using Kubernetes on premises or in another cloud and wants to migrate to AWS using Kubernetes. Kubernetes is cloud-agnostic and it can run on any cloud or on-premises infrastructure.

AWS Batch

AWS batch is fully serverless and it run batch jobs as Docker iamges and it dynamically provisions the optimal quantity and type of compute resources (EC2 & spot Instances) based on the volume and specific resource requirements of the batch jobs submitted. It is ideal for running large-scale parallel and high-performance computing (HPC) workloads in the cloud. It is not for real-time inference but it is for batch inference. We can schedule batch jobs using CloudWatch Events and Orchestrate batch jobs using Step Functions.

The difference between AWS Batch and Glue is that AWS Batch is for running batch jobs as Docker images and it is ideal for running large-scale parallel and high-performance computing (HPC) workloads in the cloud, while AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. Glue is more for data processing and transformation, while Batch is more for running batch jobs that may not necessarily be related to data processing.

Ingestion design

We can design PDF chunking and embedding as a two-tier pipeline. The default path is SQS + Lambda because most documents are small enough to process cheaply and with low operational overhead. When a document fails due to Lambda constraints, such as timeout, memory pressure, or OCR-heavy parsing, we escalate it to AWS Batch rather than retrying blindly in Lambda.

The key is failure classification. Transient failures like throttling or temporary network issues stay in the SQS/Lambda retry path. Invalid files go to a DLQ. Only resource-bound or long-running jobs are routed to Batch. We can also make the pipeline idempotent using a document ID and chunk IDs so retries do not create duplicate embeddings. This pattern is good because it optimizes cost for the common case while still handling large or complex PDFs reliably.

Deploying and Managing at Scale

CloudFormation

CloudFormation is an infrastructure as code (IaC) service that allows us to define and provision AWS infrastructure using code. It is a declarative way of outling the AWS infrastructure for most of resources.

Each resource within in the stakc is tagged withan identifier so we can easily see the costs associated with each resource. Furthermore, we can easily estimate the costs of the resources using the CloudFormation template. We can also save money by using saving strategies. For example, we can automatically delete templates at 5 PM and recreate them at 9 AM the next day, so we only pay for the resources during business hours.

It is very productive for managing AWS resources at scale and we can get a automated generation of Diagram for the templates. It is also declartive programming and we can leverage the template and document from the internet.

AWS CDK

AWS Cloud Development Kit (CDK) is an open-source software development framework that allows us to define cloud infrastructure using familiar programming languages.

We can write infrastructure code in languages like Python, TypeScript, Java, or C#, and then use the CDK to synthesize that code into CloudFormation templates. This allows us to leverage the power of programming languages, such as loops, conditionals, and functions, to create reusable and modular infrastructure code.

CodeDeploy

AWS CodeDeploy is a fully managed deployment service that automates software deployments to a variety of compute services, including Amazon EC2, on-premises servers. Servers or instances must be provisioned and configured ahead of time with the CodeDeploy Agent.

AWS CodeBuild

It allows us to compile source code, run tests, and produce software packages that are ready to deploy. It is fully managed and serverless. It is continuously scalable and highly available and we only pay for the build time.

AWS CodePipeline

It orchestrate the different steps to have the code autimatically pushed to production. It is an orchestration layer, it can get code from codeCommit and build it on CodeBuild and then deploy it with CodeDeploy with Elastic Beanstalk. It fully managed and is compatible with third-party tools like GitHub.

EventBridge

EventBridge is formerly known as CloudWatch Events.

  1. Sheduling: We can use EventBridge to schedule tasks, such as running a Lambda function every hour or triggering a batch job at a specific time.
  2. Event Pattern: Event Rules to react to a service doing something. For example, we can trigger SNS topic with an email notification when IAM Root User sign in Event is detected.

EventBridge is the default event bus for AWS services, and it has also partner Event Buses for SaaS applications and custom event buses for our own applications. There is also custom event bus for our own applications.

Event buses can be accessed by other AWS accounts using Resoruce-based Policies. We can archive events (all/filter) sent to an event bus (indefinitely or set period). We can also replay archived events for testing and debugging purposes.

EventBridge can analyze the events in our bus and infer the schema and the schema Registry allows us to generate code for our application, that will know in advance how data is structured in the event bus. The schema can be versioned and we can manage permission for a specific Event Bus. We can allow or deny event from another AWS account or AWS region, so we can aggregate all events from our AWS Organization in a single AWS account or AWS region.

Step Functions

Step Function is used to design workflows and it is easy to visualize the workflow. It has advanced Error handling and retry mechnism outside the code.

We can audit of the history of workflows. It has the ability to keep running stateful workflows for up to 1 year, which is useful for long-running processes.

It can be used for orchestrating SageMaker training jobs, batch transform jobs, and model deployment. It can also be used to orchestrate AWS Batch jobs and Lambda functions. It is a powerful tool for building complex workflows that involve multiple AWS services and it can help us automate our MLOps processes.

A workflow is called a state machine and each step in a workflow is called a state. There are different types of states, such as Task state, Choice state, Parallel state, Map state, and Pass state. Each state can have its own error handling and retry policies, which allows us to build robust and resilient workflows.

  1. Task state: It does something with Lambda, other AWS services, or third-party apis.
  2. Choice state: It is like an if-else statement, it can branch the workflow based on certain conditions.
  3. Wait state: It can delay the workflow for a certain amount of time or until a specific time.
  4. Parallel state: It adds separate branches of execution.
  5. Map state: Run a set of steps for each item in a dataset, in parallel. This one is most relevant to data engineering and works with JSON, S3 Objects, CSV files.
  6. There are states like Pass state, Fail state, Succeed state.

Apache Airflow

Apache Airflow is an open-source platform to programmatically author, schedule, and monitor workflows. It is a powerful tool for orchestrating complex data pipelines and MLOps workflows. Airflow allows us to define our workflows as code using Python, which makes it easy to create, manage, and maintain our workflows. You can use python code to creates a Directed Acyclic Graph (DAG)

Amazon MWAA provides a manged service for Apache Airflow so we do not need to maintain it. It can be used for complex workflows, ETL coordination, preparing ML data.

The DAGs (python code) are uploaded into S3 (We may also zip it together with required plugins and requirements) and MWAA picks it up and orchestrates and schedules the pipelines defined by each DAG.

Amazon MWAA (Airflow) runs within a VPC and we can deploy in at least two availability zones for high availability. We can also have private or public endpoints for the Airflow web server and we can use IAM to control access to the Airflow web server. If we want to access the Airflow web server from outside the VPC, we need to set up a public endpoints.

Airflow Workers can autoscale up to the limit we set and it can also scale down to zero when there are no tasks to run.

We can use it to:

  1. Orchestrate Complex workflows.
  2. ETL coordination.
  3. Preparing ML data.

Amazon MWAA leverages open-source integrations with AWS services, such as S3, Redshift, SageMaker, and Lambda, to enable seamless integration with our existing AWS infrastructure. The schedulers and workers themselves are AWS Fargate containers.

flowchart LR

%% =========================
%% Customer VPC
%% =========================
subgraph Customer_VPC["Customer VPC"]

    subgraph Schedulers["Airflow Schedulers"]
        S1[Scheduler 1]
        S2[Scheduler 2]
    end

    subgraph Workers["Airflow Workers"]
        BW[Base Worker]
        AW1[Additional Worker 1]
        AW2[Additional Worker 2]
    end

end


%% =========================
%% Service VPC
%% =========================
subgraph Service_VPC["Service VPC"]

    subgraph Metadata_DB["Metadata Database"]
        DBProxy[DB Proxy]
        MetaDB[(Meta Database)]
    end

    subgraph Web_Server["Airflow Web Server"]
        Web[Airflow Web Server]
    end

end


%% =========================
%% VPC Endpoints
%% =========================
DB_VPCE[Database VPCE]
WEB_VPCE[Web Server VPCE]


%% =========================
%% AWS Services
%% =========================
subgraph AWS_Services["Supporting AWS Services"]

    CW[CloudWatch]
    S3[S3]
    SQS[SQS]
    ECR[ECR]
    KMS[KMS]

end


%% =========================
%% User
%% =========================
User[User]


%% =========================
%% Internal Airflow Communication
%% =========================
S1 --- S2
BW --- AW1
BW --- AW2


%% Database connection
S1 --> DB_VPCE
S2 --> DB_VPCE
BW --> DB_VPCE
AW1 --> DB_VPCE
AW2 --> DB_VPCE

DB_VPCE --> DBProxy
DBProxy --> MetaDB


%% Web server access
S1 --> WEB_VPCE
BW --> WEB_VPCE
WEB_VPCE --> Web


%% User access
User -->|Public Network| Web


%% Workers using AWS services
BW --> CW
AW1 --> CW
AW2 --> CW

BW --> S3
AW1 --> S3
AW2 --> S3

BW --> SQS
AW1 --> SQS
AW2 --> SQS

BW --> ECR
AW1 --> ECR
AW2 --> ECR

BW --> KMS
AW1 --> KMS
AW2 --> KMS

VPC

VPC is like a large, secure office building for your network, a subnet is like an individual room or floor within that building. While a VPC spans an entire AWS Region (like all of N. Virginia), a subnet can only exist in one Availability Zone. When you create a VPC, you give it a large pool of private IP addresses (for example, 65,000 addresses). You then carve that massive pool into smaller, manageable chunks—these are your subnets.

Subnets are the primary way you control what is allowed to talk to the internet.

Public Subnets: These are configured with a direct route to the outside internet. You put things here that the public needs to reach, like a public-facing web server or an Application Load Balancer.

Private Subnets: These have no direct route to the internet. Things inside a private subnet are hidden from the outside world. You put your sensitive backend systems here, like databases, internal APIs, and your MWAA Airflow workers.

AWS Lake Formation

AWS Lake Formation is a fully managed service that makes it easy to set up a secure data lake in the cloud. It is built on top of Glue. it can:

  1. Loading data and monitoring data flows from source to data lake.
  2. Setting up partitions.
  3. It manage encrytion and keys.
  4. Defining transformation jobs and monitoring them.
  5. Access control, auditing.

A Data Lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You don’t need to structure the data first; you can store it “as-is.”

AWS Lake Formation is a governance and orchestration layer on top of S3 that automates data lake creation, manages metadata using Glue Data Catalog, and controls secure access for analytics services like Athena, Redshift, and EMR. If S3 is a vast library full of unorganized books, Lake Formation is the diligent librarian who categorizes, secures, and guides you to the right information efficiently.

flowchart LR
    subgraph Sources
        S3[S3]
        RDBMS[RDBMS]
        NoSQL[NoSQL]
        On-premises
    end

    LakeFormation[
        Lake Formation: Crawlers, ETL, data catalog, security, ACL, cleaning,transformations to Parquet,  ORC, basically anything Glue can do]

    subgraph Destinations
        Athena[Athena]
        Redshift[Redshift]
        EMR[EMR]
    end

    Sources --> LakeFormation


    LakeFormation --> Athena
    LakeFormation --> Redshift
    LakeFormation --> EMR

    S3DataLake[S3 Data Lake]
    LakeFormation <---> S3DataLake

Lake formation is free, but the uderlying services incur charges, such as S3, Glue, EMR, Athena, Redshift.

graph LR
    Step1[Create an IAM user for Data Analyst] --> Step2[Create AWS Glue connection to your data source's]
    Step2 --> Step3[Create S3 bucket for the lake]
    Step3 --> Step4[Register the S3 path in Lake Formation, grant permissions]
    Step4 --> Step5[Create database in Lake Formation for data catalog, grant permissions]
    Step5 --> Step6[Use a blueprint for a workflow ie, Database snapshot]
    Step6 --> Step7[Run the workflow]
    Step7 --> Step8[Grant SELECT permissions to whoever needs to read it Athena, Redshift Spectrum, etc]
  1. Cross-account Lake Formation permission: The recipient must be set up as a data lake administrator. We can use RAM (AWS Resource Access Manager) for accounts external to our organization. The IAM permissions for cross-account access will be relevant too.
  2. Lake Formation does not support manifest in Athena or Redshift queries.
  3. IAM permissions on KMS encryption key are needed for encrypted data catlogs in Lake Formation.
  4. IAM permissions needed to create blueprints and workflows in Lake Formation.
  5. AWS Lake formation supports “Governed Tables” that support ACID transactions across multiple tables. It is a new type of S3 table, and you can not change choice of governed afterwards and it works with streaming data like kenesis and we can query data with Athena.
  6. It has automatic storage optimization with automatic compaction.
  7. It has granular access controal with row and cell-level security. It is both for governed and s3 tables.

We can tie to IAM users/roles, SAML, or external AWS accounts. We can use policy on databases, tables, or columns. and we can select sepecific permissions for tables or columns.

Data filters in Lake Formation

with filter we can

  1. Ensure Column, row, or cell-level security.
  2. Apply when granting SELECT permission on tables.
  3. “All columns” + row filter = row-level security.
  4. “All rows” + column filter = column-level security.
  5. Specific columns + specific rows = cell-level security.
  6. Create filters via the console or via CreateDataCellsFilter API.