The cost of failure is education - Devin Carraway.

Here is a good SRE postmortem template: https://sre.google/sre-book/postmortem-culture/

I wrote here some notes of postmortem cases that I have met in my career. I will keep updating this post as I meet more cases.

Data Guardrail Overheat

On june 12, 2026, it is a normal friday afternoon and my data quality real-time guardrail system is running well. My guardrail system is a real-time rule based distributed system that monitors the data collected by human or LLM, if the data is not qualified, the data will not be able to store in the database. The guardrail system is a critical component of my data pipeline, and it is designed to prevent bad data from entering the system.

Suddenly some users reported that they cannot store the data via our platform and they saw the error message “Data Validation Failed” and I thought it was a normal case as usually that user put some bad data without realizing it. I started to investigate the issue and try to give a explanation to the user and suddenly more message comes in from other users, and they all reported the same issue. I had around 50 people working on the data collection task and they 20% of them reported the same issue. I realized that this is not a normal case and I need to investigate it further.

Here is the schema of my guardrail system:

graph TB
    subgraph CerberusLayer["Guardrail Wrapper"]
        CW["User triggers<br/>the run of rules"]
    end
    
    subgraph Config["External Config"]
        ConfigDB["Configuration and Rules<br/>(Database)"]
    end
    
    subgraph RuleInfra["Rule Infra - AWS"]
        Orch["orchestrators"]
        S3["Intermediate data<br/>storage<br/>S3"]
        Lambda["rules<br/>lambda aws"]
        StepFunc["Parallelism<br/>stepfunction"]
        DDB["Execution result<br/>DynamoDB"]
        
        Orch -->|trigger| StepFunc
        Orch -->|input write| S3
        Orch -->|output read| S3
        Orch --> DDB
        
        Lambda -->|output write| S3
        Lambda -->|input read| S3
        Lambda --> DDB
        
        StepFunc -->|coordinate| Lambda
    end
    
    CW -->|read config| ConfigDB
    CW -->|read rules| ConfigDB
    CW -->|Request to run the rule| Orch

The parallelism step function I used here is an express step function with parallel state type. I design that way because the step function will handle the parallelism and the orchestration of the rules, and the lambda function will handle the execution of the rules for better retry handling and observability.

Express Workflows are ideal for high-volume, event-processing workloads such as IoT data ingestion, streaming data processing and transformation, and mobile application backends. Express Workflows can support an execution start rate of over 100K executions per second. They can run for up to five minutes.

If we suppose we have 1 request per 2 seconds from each user and we have 50 users, that means we have 10 requests per second. If each request will trigger 50 rules, that means we have 1250 rules to run per second. The bottle neck should not be the step function, but the lambda function. The lambda function has a default concurrency limit of 1000, which means that if we have more than 1000 rules to run, some of them will be throttled and will not be able to run.

But when I checked the cloudwatch metrics, I found that the lambda function is not throttled and only some of the running are failed. I checked the cloudwatch logs and found that the lambda function is running well and the error message is “The request was rejected because the rule execution took too long to complete.” I realized that this is not a normal case and I need to investigate it further.

Bad Observability

From the backend logs, we used lightstep to trace the requests in real time and I notice that around 17:30 UTC, the validation rule endpoint started to have a high latency and the latency increased from 1000ms level to minutes level and that make the the saving process wait for too long and eventually timeout.

I logged everything but the unique call id to my guardrail system in the lightstep, so even though I can see the latency increase, I cannot see which rules and which payload are causing the high latency. So I can only note down some data points time that have high latency and got to the Kibana dashboard where we have the guardrail system logs and try to filter the data points out by the time range. But the nightmare is that as the guardrail system is a distributed system, the rules logs from different runs are in the same log stream and even give a precise time, I can not see the full logging result so I failed to find the failed rules and the payload that caused the high latency.

So at the beginning, we didn’t think that latency comes from the rules and we thought that the latency comes from the step function or the lambda function itself and we tried to increase the lambda function capacity to handle more requests.

Root Cause Analysis

When we check the cloudwatch metrics, we found that the step function is running well and the lambda function is running well, but the latency is still high. We then check the logs of the rules that has complicate logic from cloudwatch manually and found that some of the rules are taking too long to run and that is causing the high latency. We then check the rules and found that some of the rules are using the data from the our database to check the duplication or do the validation with tables.

It is that moment, we realized that the root cause may come from the database itself. I then stop the collection task and deactivate the rules that are using the database. As I design a specific function for all rules that need data from the database, I can easily filter these rules out and deactivate them. After that, the latency is back to normal and the saving process is back to normal.

We found that for this internal database, we we call it from lambda function, the database service will use a big queue to handle the request and if we call from internal service, the it will use a small queue to handle the request. So when we have a lot of requests from the lambda function, the database service will use the big queue and that will cause the high latency. So we rerouted the database service to use the small queue and attribute more resources to the database and that solved the problem.

So I failed to find the root cause of the high latency at the beginning because I didn’t have enough observability for the request id of the guardrail system, and I didn’t have enought knowledge of our internal database service and its limit.

To improve the system:

  1. Add unique call id to all external calls and log them in the lightstep for better observability.
  2. Add alerting for high latency in the guardrail system and the database service.
  3. Add cache for the data service to reduce the number of requests to the database service.