Sagemaker is built to handle the entire machine learning workflow.

graph TD
    A[Deploy model, evaluate results in production] --> B[Train and evaluate a model]
    C[Fetch, clean, prepare data] --> B
    B --> A
    B --> C

For the training and deployment

flowchart TB
    subgraph Client["SageMaker Training & Deployment Client app"]
        A[S3 Training Data]
    end
    
    subgraph SageMaker["SageMaker"]
        B[Model Training]
    end
    
    subgraph Deployment["SageMaker Endpoint"]
        C[Model Deployment/Hosting]
        D[SageMaker Endpoint]
    end
    
    subgraph ECR["ECR"]
        E[Training Code Image]
        F[Inference Code Image]
    end
    
    subgraph S3["S3"]
        G[S3 Model Artifacts]
    end
    
    A -->|input| B
    E -->|docker image| B
    B -->|output| G
    G -->|model data| C
    F -->|docker image| C
    C --> D
    Client -.->|invoke| D

All this can be done throught a notebool instance in SageMaker Studio, which is an integrated development environment (IDE) for machine learning.

Data preparation on SageMaker

Data usually comes from S3 and can also ingest from Athena, EMR, Redshift, and Amazon Keyspaces DB.Spark can also be used for data preparation on SageMaker. All the package like sklearn, xgboost, etc. are available in SageMaker. We can also use custom docker images for data preparation.

SageMaker Processing

  1. Copy data from S3 to the processing container.
  2. Run the processing script (data cleaning, feature engineering, etc.) inside the container.
  3. Copy the processed data back to S3 for use in training or other steps in the ML workflow.

Training on SageMaker

Create a training job that specifies the training data location, URL of S3 for training, ML compute resources, Url of S3 bucket for output, ECR path to training code. Training options Built-in training algorithms Spark MLlib TensorFlow / MXNet code PyTorch, Scikit-Learn, RLEstimator / MXNet code XGBoost, Hugging Face, Chainer Your own Docker image Algorithm purchased from AWS marketplace

Deploying Trained Models

Save model to S3 then:

  • Persistent endpoint for making individual predictons on demand
  • SageMaker Batch Transform to get predictions for an entire dataset.

SageMaker Modes

INput modes

S3 File Mode: copies training data from s3 to local directory in Docker container.

S3 Fast File Mode: SKin to “pipe mode”: Training can begin without waiting to dowload data. Can do random access, but works best with sequential access.

Pipe modeL Streams data directly from S3, mainly replaced by Fast File.

Amazon S3 Express One Zone: High-performance storage class in one AZ, it works with file, fast file, and pipe modes.

Amazon FSx for Lustre: High-performance file system that can be used as a data source for SageMaker training jobs. Scales to 100 GB/s of throughput and millions of IOPS with low latency. In single AZ, rquires VPC (local internet).

Amazon EFS: Requires data to be in EFS (elastic file system) already, requires VPC.

SageMaker’s Built-in Algorithms

Linear Learner

Linear regression and logistic regression for classification and regression tasks. Basically we do fit a line to the training data and do the predictions based on that line.

  • RecordIO-wrapped protobuf : Float32 data only.
  • CSV: First column assumed to be the label. File and pipe modes supported.

Preprocessing

Training data must be normalized and input data should b shuffled.

Training: Uses stochastic gradient descent (Adam, AdaGrad, SGD, etc), Multiple models are optimized in parallel. Tune L1, L2 regularization.

Validation: Most optimal model is selected.

Hyperparameters

  • balance_multiclass_weights: Whether to balance the weights for multiclass classification.
  • Learnin_rate, minibatch_size, num_epochs, etc.
  • L1 and L2 regularization parameters.
  • Weight decay: A regularization technique that adds a penalty to the loss function based on the magnitude of the model’s weights. This helps prevent overfitting by discouraging the model from assigning too much importance to any single feature.
  • Target_precision: recall_at_target_precision, the algorithm holds precision at this value while maximizing recall
  • Target_recall: precision_at_target_recall, it holds recall at this value while maximizing precision. The algorithm then selects the model that meets the specified criterion: for precision_at_target_recall, it picks the one maximizing precision while holding recall ≥ target_recall; vice versa for recall_at_target_precision. This “tuning-in-training” avoids separate hyperparameter tuning jobs and uses efficient SGD optimizations.

XGBoost

eXtreme Gradient Boosting (XGBoost) is an optimized distributed gradient boosting library designed to be highly efficient.

It bosted group of decision trees, new trees made to correct the errors of previous trees. It uses gradient descent to minimize loss as new trees are added.

It can be used both for classification and regression tasks.

Models are serialized and deserialized with pickle.

Hyperparameters

  • Subsample: Fraction of the training data to be used for growing each tree. It helps prevent overfitting by introducing randomness into the training process.
  • Eta: step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features, and eta shrinks the feature weights to make the boosting process more conservative.
  • Gamma: minimum loss reduction to create a partition; larger means more conservative algorithm.
  • Alpha: L1 regularization term on weights. Increasing this value will make the model more conservative.
  • Lambda: L2 regularization term on weights. Increasing this value will make the model more conservative.
  • eval_metric: Optimize on AUC, error,rmse… if you care about false positives more than accuracy, you might use AUC.
  • scale_pos_weight: Adjust balance of positive and negative weights, useful for unbalanced classes. A value greater than 1 will give more weight to the positive class, while a value less than 1 will give more weight to the negative class.
  • max_depth: Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. Decreasing this value will make the model simpler and less likely to overfit.