There are many challenges in large-scale machine learning, here are some main ones:

Data management: Handling and processing large volumes of data efficiently is a significant challenge. This includes data storage, retrieval, and preprocessing. How to get the features fro the raw data and feed them to the model.
Model complexity: As the size of the data increases, the complexity of the model may also need to increase to capture the underlying patterns. nowadays, models can go up to hundrens of billions parameters. This can lead to longer training times and increased computational resources.
mesurement and evaluation: Evaluating the performance of large-scale machine learning models can be challenging, especially when dealing with imbalanced datasets or when the model is deployed in a real-world setting. It is important to have robust evaluation metrics and techniques to ensure that the model is performing well and contribute to the bussiness value.
Scalability: Ensuring that the machine learning system can scale effectively during the infernence phase is crucial. This involves optimizing the model for efficient inference, managing resources effectively, and ensuring that the system can handle a large number of requests without significant latency.

Data Ingestion

Streaming Ingestion: This involves continuously collecting and processing data in real-time. It is essential for applications that require up-to-date information, such as recommendation systems or fraud detection. There are basically three types of streaming ingestion:

Clickstream Ingestion

The are the high frequency events generated by users interacting with a website, application voice assitant. This data can be used for various purposes, such as analyzing user behavior, personalizing content, and improving user experience. The frequence can be potentially millions per second. We often capture the informatio such as:

Timestamp: The time at which the event occurred.
User agent: Information about the user’s device and browser.
IP address: The user’s IP address, which can provide insights into their location and network.
User ID: A unique identifier if the user is logged in.
request details: how the user used the endpoints.
Session ID (cookiesm helps tie clicks stream entries)
Referrer: how the user arrived at the website (e.g., search engine, social media, direct link).

Tools for clickstream ingestion include Apache Kafka, Amazon Kinesis, and Google Cloud Pub/Sub. These tools can handle high volumes of data and provide real-time processing capabilities.

flowchart LR
    subgraph Producers
        P1[Clickstream Logs]
        P2[Clickstream Logs]
        P3[Clickstream Logs]
    end

    subgraph Kafka or Kinesis
        B[Broker Cluster]
    end

    subgraph Consumers
        C1[Storage]
        C2[Storage]
        C3[Storage]
    end

    P1 --> B
    P2 --> B
    P3 --> B

    B --> C1
    B --> C2
    B --> C3

In the broker cluster, we often use partitioning to distribute the data across multiple brokers. We have also a zookeeper, the coordination service that manages cluster metadata and leadership to manage the state of the brokers and ensure high availability. In modern Kafka deployments, ZooKeeper is replaced by KRaft mode, where Kafka manages its own metadata using Raft consensus.

Large-scale machine learning

Data Ingestion

Clickstream Ingestion

Change Data Capture (CDC)

Live Video Ingestion

Data Ingestion#

Clickstream Ingestion#

Change Data Capture (CDC)#

Live Video Ingestion#

Data Ingestion

Clickstream Ingestion

Change Data Capture (CDC)

Live Video Ingestion