Sagemaker is built to handle the entire machine learning workflow.
graph TD
A[Deploy model, evaluate results in production] --> B[Train and evaluate a model]
C[Fetch, clean, prepare data] --> B
B --> A
B --> C
For the training and deployment
flowchart TB
subgraph Client["SageMaker Training & Deployment Client app"]
A[S3 Training Data]
end
subgraph SageMaker["SageMaker"]
B[Model Training]
end
subgraph Deployment["SageMaker Endpoint"]
C[Model Deployment/Hosting]
D[SageMaker Endpoint]
end
subgraph ECR["ECR"]
E[Training Code Image]
F[Inference Code Image]
end
subgraph S3["S3"]
G[S3 Model Artifacts]
end
A -->|input| B
E -->|docker image| B
B -->|output| G
G -->|model data| C
F -->|docker image| C
C --> D
Client -.->|invoke| D
All this can be done throught a notebool instance in SageMaker Studio, which is an integrated development environment (IDE) for machine learning.
Data preparation on SageMaker
Data usually comes from S3 and can also ingest from Athena, EMR, Redshift, and Amazon Keyspaces DB.Spark can also be used for data preparation on SageMaker. All the package like sklearn, xgboost, etc. are available in SageMaker. We can also use custom docker images for data preparation.
SageMaker Processing
- Copy data from S3 to the processing container.
- Run the processing script (data cleaning, feature engineering, etc.) inside the container.
- Copy the processed data back to S3 for use in training or other steps in the ML workflow.
Training on SageMaker
Create a training job that specifies the training data location, URL of S3 for training, ML compute resources, Url of S3 bucket for output, ECR path to training code. Training options Built-in training algorithms Spark MLlib TensorFlow / MXNet code PyTorch, Scikit-Learn, RLEstimator / MXNet code XGBoost, Hugging Face, Chainer Your own Docker image Algorithm purchased from AWS marketplace
Deploying Trained Models
Save model to S3 then:
- Persistent endpoint for making individual predictons on demand
- SageMaker Batch Transform to get predictions for an entire dataset.
SageMaker Modes
INput modes
S3 File Mode: copies training data from s3 to local directory in Docker container.
S3 Fast File Mode: SKin to “pipe mode”: Training can begin without waiting to dowload data. Can do random access, but works best with sequential access.
Pipe modeL Streams data directly from S3, mainly replaced by Fast File.
Amazon S3 Express One Zone: High-performance storage class in one AZ, it works with file, fast file, and pipe modes.
Amazon FSx for Lustre: High-performance file system that can be used as a data source for SageMaker training jobs. Scales to 100 GB/s of throughput and millions of IOPS with low latency. In single AZ, rquires VPC (local internet).
Amazon EFS: Requires data to be in EFS (elastic file system) already, requires VPC.
SageMaker’s Built-in Algorithms
Linear Learner
Linear regression and logistic regression for classification and regression tasks. Basically we do fit a line to the training data and do the predictions based on that line.
- RecordIO-wrapped protobuf : Float32 data only.
- CSV: First column assumed to be the label. File and pipe modes supported.
Preprocessing
Training data must be normalized and input data should b shuffled.
Training: Uses stochastic gradient descent (Adam, AdaGrad, SGD, etc), Multiple models are optimized in parallel. Tune L1, L2 regularization.
Validation: Most optimal model is selected.
Hyperparameters
- balance_multiclass_weights: Whether to balance the weights for multiclass classification.
- Learnin_rate, minibatch_size, num_epochs, etc.
- L1 and L2 regularization parameters.
- Weight decay: A regularization technique that adds a penalty to the loss function based on the magnitude of the model’s weights. This helps prevent overfitting by discouraging the model from assigning too much importance to any single feature.
- Target_precision: recall_at_target_precision, the algorithm holds precision at this value while maximizing recall
- Target_recall: precision_at_target_recall, it holds recall at this value while maximizing precision. The algorithm then selects the model that meets the specified criterion: for precision_at_target_recall, it picks the one maximizing precision while holding recall ≥ target_recall; vice versa for recall_at_target_precision. This “tuning-in-training” avoids separate hyperparameter tuning jobs and uses efficient SGD optimizations.
Emsemble methods
Many ensemble methods are based on decision trees, which are a type of model that makes predictions by learning simple decision rules inferred from the data features.
Decision Tree
Decision trees use a greedy, recursive partitioning strategy to find the optimal split at each node. At each step:
- Evaluate all features: The algorithm considers every input variable as a potential splitter
- Find the best split point: For each feature, it tests all possible values to find the threshold that produces the most homogeneous child nodes
- Choose the optimal feature: The feature and split point that maximize purity (or minimize impurity) are selected
Splitting Criteria:
- Classification: Gini impurity, entropy (information gain), misclassification error.
- Regression: Mean squared error (MSE), Sum of squared errors (SSE), Mean absolute error (MAE).
How split points are chosed For Continuous Features The algorithm sorts all unique values of the feature and tests every possible threshold between consecutive values as a potential split point. For a feature with n unique values, there are n−1 candidate split points. The algorithm calculates the impurity reduction for each candidate and selects the threshold that yields the highest purity gain.
For Categorical Features The algorithm evaluates splits based on the categorical values themselves. For a feature with k categories, it considers various ways to partition these categories into two groups (for binary trees).
Gini impurity is calculated as: $$Gini = 1 - \sum_{i=1}^{C} p_i^2$$ where $p_i$ is the proportion of samples belonging to class $i$ in the node, and $C$ is the total number of classes. A lower Gini impurity indicates a more homogeneous node.
Entropy is calculated as: $$Entropy = -\sum_{i=1}^{C} p_i \log_2(p_i)$$ where $p_i$ is the proportion of samples belonging to class $i$ in the node. A lower entropy indicates a more homogeneous node.
Information Gain is calculated as: $$Information\ Gain = Entropy(parent) - \sum_{k=1}^{n} \frac{N_k}{N} Entropy(child_k)$$ where $N$ is the total number of samples in the parent node, $N_k$ is the number of samples in child node $k$, and $n$ is the number of child nodes. The algorithm selects the split that maximizes information gain.
Stop Criteria:
- The recursive splitting continues until a stopping criterion is met.
- The node becomes pure (contains only one class)
- A maximum depth is reached
- The number of samples in a node falls below a minimum threshold
- Further splitting no longer improves predictions significantly
Ensemble learning combines several learners (models) to improve overall performance. The variance of the general model decreases thanks to the bagging technique and the bias of the general model decreases thanks to the boosting technique. Random Forest is a bagging method, while XGBoost and LightGBM are boosting methods.
- Bagging: Bootstrap Aggregating, it trains multiple models on different subsets of the training data and then combines their predictions. It can help reduce variance and prevent overfitting. The selection of subset is done by random sampling with replacement, which means that some data points may be included multiple times in the same subset, while others may not be included at all.
- Boosting: In boosting, models are trained sequentially, with each model trying to correct the errors of the previous one. It can help reduce bias and improve the performance of weak learners. bagging typically involves simple averaging of models, while boosting assigns weights based on accuracy. AdaBoost, Gradient Boosting, XGBoost, and LightGBM are popular boosting algorithms.
Homogeneous vs Heterogeneous Ensembles
- Homogeneous Ensembles: All the models in the ensemble are of the same type. For example, an ensemble of decision trees (Random Forest) or an ensemble of linear models (Linear Learner). But we weight the training data differently for each model, so that each model focuses on different aspects of the data.
- Heterogeneous Ensembles: The models in the ensemble are of different types. For example, an ensemble that combines decision trees, linear models, and neural networks.
XGBoost
eXtreme Gradient Boosting (XGBoost) is an optimized distributed gradient boosting library designed to be highly efficient.
It bosted group of decision trees, new trees made to correct the errors of previous trees. It uses gradient descent to minimize loss as new trees are added.
It can be used both for classification and regression tasks.
Models are serialized and deserialized with pickle.
Hyperparameters
- Subsample: Fraction of the training data to be used for growing each tree. It helps prevent overfitting by introducing randomness into the training process.
- Eta: step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features, and eta shrinks the feature weights to make the boosting process more conservative.
- Gamma: minimum loss reduction to create a partition; larger means more conservative algorithm.
- Alpha: L1 regularization term on weights. Increasing this value will make the model more conservative.
- Lambda: L2 regularization term on weights. Increasing this value will make the model more conservative.
- eval_metric: Optimize on AUC, error,rmse… if you care about false positives more than accuracy, you might use AUC.
- scale_pos_weight: Adjust balance of positive and negative weights, useful for unbalanced classes. A value greater than 1 will give more weight to the positive class, while a value less than 1 will give more weight to the negative class.
- max_depth: Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. Decreasing this value will make the model simpler and less likely to overfit.
LightGBM
LightGBM is also gradient boosting decision tree algorithm, but it uses a different approach to build trees. It uses Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) to speed up the training process and reduce memory usage. GOSS focuses on the instances with larger gradients, while EFB bundles mutually exclusive features together to reduce the number of features. It is more adapt with large datasets and high-dimensional data, and it can be faster than XGBoost in some cases.
It a single or multi-instance CPU algorithm, and it can be used for classification and regression tasks. It is a memory-bound algorithm, M5 EC2 instance is recommended for training.
Hyperparameters
- learning_rate: Similar to eta in XGBoost, it controls the step size shrinkage used in update to prevents overfitting.
- num_leaves: Maximum number of leaves in one tree.
- feasture_fraction: Subset of features to be used for each tree.
- bagging_fraction: Features are randomly sampled to be used for each tree.
- bagging_freq: how often bagging is done.
- max_depth: Maximum depth of a tree.
- min_data_in_leaf: Minimum number of data points in a leaf.
Comparison between XGBoost and LightGBM
Seq2Seq
Input sequence is transformed into a fixed-length vector by an encoder, and then the decoder generates the output sequence from the vector. It is widely used in tasks like machine translation, text summarization, etc. The encoder and decoder can be implemented using RNNs, LSTMs, GRUs, or Transformers.
We need to provide training data, validation data and vocabulary file for seq2seq training. This model an only be trained on single machine but can have multiple GPUs, P3 is adapted for this kind of training.
Hyperparameters
- batch_size: Number of training examples used in one iteration.
- optimiser_type: Adam, sgd, rmsprop, etc.
- num_layers_encoder: Number of layers in the encoder.
- num_layers_decoder: Number of layers in the decoder.
- Can optimize on:
Accuracy (compare to validation dataset): $$\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}$$
BLEU score (Bilingual Evaluation Understudy): A metric for evaluating the quality of text generated by comparing it to one or more human reference translations. It focuses on precision—how many of the generated n-grams appear in the reference. $$\text{BLEU} = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)$$ Where $BP$ is the brevity penalty, $w_n$ are the weights for n-grams, and $p_n$ is the precision for n-grams of size $n$. The brevity penalty is calculated as: $$BP = \begin{cases} 1 & \text{if } c > r \ e^{(1 - r/c)} & \text{if } c \leq r \end{cases}$$ where $c$ is the length of the candidate translation and $r$ is the length of the reference translation.
ROUGE score (Recall-Oriented Understudy for Gisting Evaluation): A set of metrics for evaluating automatic summarization and machine translation. It focuses on recall—how many of the reference n-grams appear in the generated text. $$\text{ROUGE-N} = \frac{\sum_{\text{reference summaries}} \sum_{\text{n-grams}} \text{Count}{\text{match}}(\text{n-gram})}{\sum{\text{reference summaries}} \sum_{\text{n-grams}} \text{Count}(\text{n-gram})}$$
Perplexity (Cross-entropy). $$\text{Perplexity} = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i | w_1 \ldots w_{i-1})\right) = \exp(H)$$
DeepAR
DeepAR is a forecasting algorithm that uses recurrent neural networks (RNNs) to model time series data. It allows us to train the same model over several related time series.
BlazingText
BlazingText is for Text classification and word embedding. It is based on the Word2Vec algorithm, which learns word embeddings by predicting the context of a word given its surrounding words (CBOW) or by predicting a word given its context (Skip-gram) or Batch skip-gram (distributed computation over many CPU nodes). BlazingText can be used for tasks like sentiment analysis, topic classification, etc. It works one single word.
Object2Vec
Like word2vec, but for objects. It learns vector representations of objects based on their co-occurrence in the data. It can be used for tasks like recommendation systems, anomaly detection, etc. It works on objects, which can be users, products, etc.
We can process data into JSON lines and shuffle it.
{"label": "cluster_1", "text": "I love this product!"}
{"label": "cluster_2", "text": "This is the worst experience I've ever had."}
Train with two input channels, two encoders, and a comparator. The encoders can be Average-pooled embeddings, CNN’s, or Birectional LSTM. The comparator is followed by a feed-forward neural network.
Computer vison
SageMaker support frameworks like TensorFlow, PyTorch, MXNet, etc. for computer vision tasks. It also has built-in algorithms like Image Classification and Object Detection. We can also use custom algorithms for computer vision tasks.
Random Cut Forest
Random Cut Forest (RCF) is an algorithm for: Anomaly detection, breaks in peridicity, unclassified data points, and it also attribute anomaly score to each data point.
It creates a forest of trees where each tree is a aprtition of the training data. Then it looks at expected change in complexity of the tree as a result of adding a new data point. The data is sampled randomly and then trained on the sample. RCF can work with Kinesis Analytics for real-time anomaly detection on streaming data.
Hyperparameters
- num_trees: Number of trees in the forest, more trees can reduces noise.
- num_smaples_per_tree: Number of samples used to build each tree, smaller samples can make the model more sensitive to anomalies but also more noisy. 1/num_samples_per_tree approximates the ratio of animalous points to normal data.
Neural Topic Model
Neural Topic Model (NTM) is a topic modeling algorithm that uses neural networks to learn the underlying topics in a collection of documents. It is based on an unsupervised learning algorithm: Neual Variational Inference. It organize documents into topics,and classify or summarize documents based on the topics. We need to define how many topics we want and these topics are a latent representation based on top ranking words.
LDA
Latent Dirichlet Allocation (LDA) is a generative probabilistic model for topic modeling algortihm. It assumes that documents are a mixture of topics and that each topic is a distribution over words. LDA uses a Dirichlet distribution to model the topic distribution for each document and the word distribution for each topic. It is also an unsupervised learning algorithm, and it can be used for tasks like document classification, information retrieval, etc.
Hyperparameters
- num_topics: Number of topics to be extracted from the documents.
- alpha: Hyperparameter for the Dirichlet distribution on the per-document topic distributions. A higher value of alpha leads to more topics being assigned to each document, while a lower value leads to fewer topics being assigned to each document.
KNN
Suppervised learning algorithm for classification and regression tasks. It works by finding the k nearest neighbors of a data point and making predictions based on the majority class (for classification) or the average value (for regression) of those neighbors. It is a non-parametric algorithm, which means it does not make any assumptions about the underlying data distribution. It can be used for tasks like image classification, text classification, etc.
Hyperparameters
- n_neighbors: Number of neighbors to use for making predictions. A smaller value of n_neighbors can lead to a more flexible model that captures local patterns in the data, while a larger value can lead to a smoother model that captures global patterns in the data.
- Sample_size: Number of samples to be used for training the model.
K-Means
Unsupervised learning algorithm for clustering tasks. It divides data into K groups, whre members of a group are as similar as possible to each other. We can define what similar means by defining a distance metric, like Euclidean distance, Manhattan distance, etc. It is an iterative algorithm that starts with K random centroids and then assigns each data point to the nearest centroid, and then updates the centroids based on the mean of the assigned data points. It can be used for tasks like customer segmentation, image segmentation, etc.
Given a dataset:
$$ X = {x_1, x_2, \dots, x_n}, \quad x_i \in \mathbb{R}^d $$
We want to partition the data into $K$ clusters by minimizing the within-cluster sum of squares:
$$ J(\mu, C) = \sum_{k=1}^{K} \sum_{x_i \in C_k} |x_i - \mu_k|^2 $$
Where:
- $C_k$ is the set of points assigned to cluster $k$
- $\mu_k \in \mathbb{R}^d$ is the centroid of cluster $k$
Step1 Initialize centroids:
$$ \mu_1^{(0)}, \mu_2^{(0)}, \dots, \mu_K^{(0)} $$
Step 2 — Repeat Until Convergence
For iteration $t = 1,2,\dots$: $$ C_i^{(t)} = \arg\min_{k \in {1,\dots,K}} |x_i - \mu_k^{(t-1)}|^2 $$
This defines clusters:
$$ C_k^{(t)} = { x_i \mid C_i^{(t)} = k } $$
Recompute centroids as the mean of assigned points:
$$ \mu_k^{(t)} = \frac{1}{|C_k^{(t)}|} \sum_{i : C_i^{(t)} = k} x_i $$
Step 3 — Convergence Criterion Stop if:
$$ \max_k |\mu_k^{(t)} - \mu_k^{(t-1)}| < \varepsilon $$
or if assignments no longer change.
Computational Complexity
Per iteration:
$$ O(n K d) $$
Total complexity:
$$ O(T n K d) $$
Where:
- $n$ = number of samples
- $K$ = number of clusters
- $d$ = dimension
- $T$ = number of iterations
Derivation of the Update Rule
We minimize:
$$ \sum_{x_i \in C_k} |x_i - \mu_k|^2 $$
Take derivative:
$$ \frac{\partial}{\partial \mu_k} \sum_{x_i \in C_k} |x_i - \mu_k|^2
-2 \sum_{x_i \in C_k} (x_i - \mu_k) $$
Set derivative to zero:
$$ \sum_{x_i \in C_k} (x_i - \mu_k) = 0 $$
Therefore:
$$ \mu_k = \frac{1}{|C_k|} \sum_{x_i \in C_k} x_i $$
PCA
Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms a high-dimensional dataset into a lower-dimensional space while preserving as much variance as possible. It does this by finding the directions (principal components) along which the variance of the data is maximized. PCA can be used for tasks like data visualization, noise reduction, and feature extraction.
The reduced dimensions are called components, and they are the eigenvectors of the covariance matrix of the data. The amount of variance preserved in the reduced space is determined by the eigenvalues corresponding to those eigenvectors. PCA can be used for tasks like data visualization, noise reduction, and feature extraction.
With data, we first create Covariance matrix, then we do the singular value decomposition (SVD) of the covariance matrix to get the eigenvalues and eigenvectors. We then select the top k eigenvectors corresponding to the largest eigenvalues to form the projection matrix.
There are two modes for PCA in SageMaker:
- The regular mode: For sparse data and moderate number of observations and features.
- Randomized: This is used for large number of observations and features. It uses approximate algorithm.
Factorization Machines
Factorization Machines (FM) is a supervised learning algorithm that can model interactions between features in high-dimensional sparse datasets. It is particularly effective for tasks like recommendation systems, click prediction, and regression problems.
The Challenge of Feature Interactions
In many real-world prediction tasks (recommender systems, click-through rate prediction), the input features are:
- High-dimensional: Millions of features after one-hot encoding
- Sparse: Each example has only a few non-zero values
- Rich in interactions: The prediction depends on combinations of features (e.g., user ID × item ID, ad category × page context)
The Problem with Linear Models
A standard linear regression model:
$$ \hat{y}(x) = w_0 + \sum_{i=1}^{d} w_i x_i $$
can only capture the independent effect of each feature. It cannot model that “user A liking item B” is a specific interaction effect.
The Problem with Explicit Interaction Terms
Adding all pairwise interactions naively:
$$ \hat{y}(x) = w_0 + \sum_{i=1}^{d} w_i x_i + \sum_{i=1}^{d}\sum_{j=i+1}^{d} w_{ij} x_i x_j $$
requires $O(d^2)$ parameters. For $d = 1,000,000$ features, this is $10^{12}$ parameters—impossible to estimate, especially since most feature pairs never co-occur in sparse data.
2. The FM Solution: Factorized Interactions
Factorization Machines address this by factorizing the interaction weight $w_{ij}$ into the dot product of two latent vectors $\mathbf{v}_i$ and $\mathbf{v}_j$:
$$ \hat{y}(x) = w_0 + \sum_{i=1}^{d} w_i x_i + \sum_{i=1}^{d}\sum_{j=i+1}^{d} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j $$
Components
| Symbol | Description | Dimension |
|---|---|---|
| $w_0$ | Global bias | $\mathbb{R}$ |
| $\mathbf{w}$ | Linear weights | $\mathbb{R}^d$ |
| $\mathbf{V}$ | Embedding matrix | $\mathbb{R}^{d \times k}$ |
| $\mathbf{v}_i$ | Latent vector for feature $i$ | $\mathbb{R}^k$ |
| $k$ | Embedding dimension | $k \ll d$ (typically 10-100) |
Inner Product Definition
$$ \langle \mathbf{v}i, \mathbf{v}j \rangle = \sum{f=1}^{k} v{i,f} \cdot v_{j,f} $$
Key Insight
Instead of learning a unique parameter $w_{ij}$ for each pair (which requires seeing that pair in training), FM learns embeddings $\mathbf{v}_i$ for each feature. The interaction between features $i$ and $j$ is computed as the dot product of their embeddings.
Benefits:
- Parameter reduction: From $O(d^2)$ to $O(dk)$ parameters
- Generalization: Even if features $i$ and $j$ never co-occurred in training, their embeddings can still produce meaningful interactions
3. The Computational Trick: Linear Complexity
Computing all pairwise interactions naively is $O(d^2 k)$—too expensive for high-dimensional sparse data.
The Solution
FM uses an algebraic reformulation to reduce complexity to $O(dk)$:
$$ \sum_{i=1}^{d}\sum_{j=i+1}^{d} \langle \mathbf{v}i, \mathbf{v}j \rangle x_i x_j = \frac{1}{2} \sum{f=1}^{k} \left[ \left(\sum{i=1}^{d} v_{i,f} x_i\right)^2 - \sum_{i=1}^{d} v_{i,f}^2 x_i^2 \right] $$
Why This Works
- Single pass per dimension: The term $\sum_{i=1}^{d} v_{i,f} x_i$ can be computed once per dimension $f$
- Leverages sparsity: For sparse inputs, only non-zero $x_i$ need to be processed
- Linear complexity: The complexity becomes linear in the number of non-zero features
Algorithm Steps
# Pseudocode for FM prediction
def predict(x, w0, w, V, k, d):
# Linear part
result = w0
for i in non_zero_features(x):
result += w[i] * x[i]
# Interaction part (O(n*k) where n = non-zero features)
for f in range(k):
sum_vf_x = 0
sum_vf2_x2 = 0
for i in non_zero_features(x):
sum_vf_x += V[i, f] * x[i]
sum_vf2_x2 += (V[i, f] ** 2) * (x[i] ** 2)
result += 0.5 * (sum_vf_x ** 2 - sum_vf2_x2)
return result