CloudWatch

CloudWatch Metrics

CloudWatch provides metrics for every services in AWS and A Metric is a variable to monitor (CPU utilization, networkIn, networkOut, etc.).A metric is belong to namespaces. A Dimension is an attribue of a metric (InstanceId, AutoScalingGroupName, etc.). We can have up to 30 dimensions per metric. Metrics have timestamps and we can create CloudWatch dashboards of metrics. It is also possible to create CloudWatch Custom Metrics (for the RAM monitoring of an EC2 instance, for example).

We can continually stream CloudWatch metrics to a destination of our choice, with near-real-time delivery and low latency.

The data streaming can be realized with Kinesis Data Firehose. It is also possible to filter metrics to only stream a subset of them to firehose.

graph TD
    %% Define Nodes
    CW[CloudWatch Metrics]
    KDF[Kinesis Data Firehose]
    S3[Amazon S3]
    Redshift[Amazon Redshift]
    OS[Amazon OpenSearch]
    Athena[Athena]

    %% Define Connections
    CW -- "Stream near-real-time" --> KDF
    KDF --> S3
    KDF --> Redshift
    KDF --> OS
    S3 --> Athena

CloudWatch Logs

CloudWatch logs is a perfect place to store logs on AWS.

We need to first define a log group which usually corresponds to an application or a service and it has an arbitrary name. A log stream are instance within application, log files or containers. Then we can define log expiration policies like never expire, expire in any duration between 1 dat to 10 years.

CloudWatch Logs can send logs to:

  • Amazon S3 for long-term storage and analysis.
  • Kinesis Data Streams
  • Kinesis Data Firehose
  • AWS Lambda
  • OpenSearch Logs are encrypted by default and we can setup KMS-based encryption with our own keys.

CloudWatch Logs - Source

We have several options to send logs to CloudWatch Logs:

  • CloudWatch Logs Agent: It is a software that can be installed on EC2 instances.
  • CloudWatch Unified Agent: It is a software that can be installed on EC2 instances, on-premises servers and virtual machines. It can collect both logs and metrics.
  • AWS SDKs and APIs: We can use AWS SDKs and APIs to send logs directly to CloudWatch Logs from our applications.
  • Elastic Beanstalk: It can do collection of logs from application.
  • ECS: It collect from container logs.
  • Lambda: It collect from Lambda function logs.
  • VPC Flow Logs: VPC specific logs.
  • API Gateway.
  • CloudTrail based on filter.

CloudWatch Logs Insights

CloudWatch Logs Insights is a fully managed service that allows us to interactively search and analyze logs stored in CloudWatch Logs. We can apply fiters and queries to extract insights from our log data. It uses a query language that is similar to SQL, making it easy for users familiar with SQL to get started.

It provides a purpose-build query language and it automatically discover fields from AWS services and JSON log events. We can fetch desired event fields, filter based on condition,calculate aggregate statistics, sort envents, limit number of events…

We can save queries for later use and add them to CloudWatch dashboards. We can query multiple Log Groups in different AWS accounts and it is a query engine, not a real-time engine.

S3 Export

Log data can take up to 12 hours to become available for export and the API call is CreateExportTask.

CloudWatch Logs Subscriptions

This allow us to get a real-time log events from CloudWatch Logs for processing and analysis. We can send log to Kinesis Data Streams, Kinesis Data Firehose, AWS Lambda, OpenSearch Service. We can also sepecify filter which logs are events delivered to the destination.

We can also do the Cross-Account Subscription to send log events to resources in a different AWS account.

CloudWatch Logs for EC2

By default, no logs from the EC2 machine will go to CloudWatch and we need to run a CloudWatch agent on EC2 to push the log files.

We need to make sure IAM permissions are correct. This cloud watch agent can be installed on EC2 instances, on-premises servers and virtual machines. It can collect both logs and metrics.

The cloudWatch unified agent collects additional system-level metrics such as RAM, process, etc. It collect logs to send to CloudWatch Logs and we can use centralized configuration using SSM parameter store.

The collected server metrics have very granular level of detail.

  • CPU: % user, % system, % idle, % steal, etc.
  • Memory: used, available, cached, etc.
  • Disk: free, used, total
  • Disk IO: read bytes, write bytes, read ops, write ops, etc.
  • Netstat: number of tcp and udp connections, net packets, bytes in and out, etc.
  • Processes: total, dead, bloqued, idle, running, sleep, etc.
  • Swap Space: free, used, used percent, etc.

CloudWatch Alarms

Alarms are used to trigger notifications for any metric. It has various options such sampling, percentage, min, max, etc.

CloudWatch Alarms are on a single metric and composite alarms are monitoring the states of multiple other alarms. WE can use AND and OR conditions. This is helpful to reduce “alarm noise” by creating complex composite alarms.

Alarm states:

  • OK: The metric is within the defined threshold.
  • ALARM: The metric is outside the defined threshold.
  • INSUFFICIENT_DATA: There is not enough data to determine the state of the alarm.

Period:

  • The length of time to evaluate the metric against the threshold. It can be as short as 10 seconds, 30 seconds, or multiples of 60 seconds.

The alarms can be created based on CloudWatch Logs Metrics Filters and to test alarms and notifications, we can set alarm state to AlARM using CLI.

aws cloudwatch set-alarm-state --alarm-name "MyAlarm" --state-value ALARM --reason "Testing alarm state change"

EC2 Instance Recovery

EC2 instance recovery is a feature that allows us to automatically recover an instance if it becomes impaired due to an underlying hardware failure or a problem that requires AWS involvement to repair. We can create a CloudWatch alarm to monitor the instance’s status and trigger the recovery action when needed. The recovery happens we got same private, public, elastic IP, metadata and placement group and a message is sent to SNS topic when the recovery action is triggered.

AWS X-Ray

Debugging in production was difficult, as we needed to test everything locally and add log statements everywhere and re-deploy in production. Log format differ across applications using CloudWatch and analytics is hard.

Debugging monolith system is easy but for distributed system, it is much harder to debug and understand the system. So there is no common views of the entire architecture.

AWS X-Ray give a visual analysis of our applications. X-Ray can trace requests as they travel through our application and it can provide a visual representation of the application’s architecture, including the interactions between different services and components. It can also help us

  • identify performance bottlenecks, errors, and other issues in our application.
  • Understand dependencies in a microservice atchitecture.
  • Pinpoint service issues.
  • Review request behavior.
  • Check time SLA, where we are throlled, etc.
  • It is compatible with AWS Lambda, Elastic beanstalk, ECS, ELB, API Gateway, EC2 Instances or any application server (even on-premises) using AWS X-Ray SDKs.

AWS X-Ray Leverages Tracing

  • Tracing is an end to end way to following a “request”.
  • Each component dealing with teh request adds its own “trace”.
  • Tracing is made of segments and a segment is made of sub-segments.
  • Annotation can be added to traces to provide extra-information. With these trace, we can trace every request or sample requests (a percentage of example or a rate per minute).
  • X-Ray Security: We can use IAM for authorization and KMS for encryption at rest.

How to enable X-Ray

  1. Add AWS X-ray SDK in the code. The application SDK will then capture calls to AWS services, HTTP/HTTPS requests, Database Calls, Queue calls.
  2. Install the X_Ray daemon or enable X-Ray AWS Integration. The X-Ray daemon works as a low level UDP packet interceptor. The AWS Lambda or other AWS services already run the X-Ray daemon. Each application must have the IAM rights to write data to X-Ray. The X-ray daemon will send batch of trace data to X-Ray service every second.

EC2 does not have X-Ray integration and we need to install the X-Ray daemon on EC2 instances and ensure the EC2 IAM Role has the proper permissions.

To enable on AWS Lambda, we need to ensure it has an IAM execution role with proper policy (AWSX-RayWriteOnlyAccess).

Amazon QuickSight

This is the tool for business analytics and visualizations in the cloud. It allows all employees in an organization to build viaualizations, perform ad-hoc analysis and quickly get insights from data. We can access it anytime on any device (browser, mobile, etc). It is of course a serverless application.

We can connect it to:

  • Redshift
  • Aurora/RDS
  • Athena
  • EC2-hosted databases
  • Files (S3 or on-premises): CSV, Excel, TSV, Common or extedned log format.
  • AWS IoT Analytics
  • Data preparation allows limited ETL.

SPICE

SPICE is the abreviation for Super-fast, Parallel, In-memory Calculation Engine. It uses columnar storage, in-memory processing and machine code generation. It accelerates interactive queries on large datasets.

Each user gets 10 GB of SPICE and it is higly available and durable and it scales to hundreds of thousands of user.

Use cases

  • It gives interactive ad-hoc exploration and visualization of data.
  • We can create dashboards and KPI’s.
  • Analyze/visualize data from:
    • Logs in S3
    • On-premise databases
    • AWS (RDS, Redshift, Athena, S3)
    • SaaS applications, such as Salesforce.
    • Any JDBC/ODBC data source.

Machine Learning Insights

We can use it to:

  • Anomaly detection.
  • Auto-narratives.

QuickSight Dashboards

The Dashboards are interactive and we can share them with other users. We can also embed them in applications, portals, etc. It is possible to set up email reports and alerts based on thresholds.

  • AutoGraphs: It automatically selects the best visual for our data.
  • Bar Charts: For comparison and distribution (histograms).
  • Line graphs: For changes over time.
  • Scatter plots, heat maps: For correlation.
  • Pie graphs, tree maps: For aggregation and part-to-whole relationships.
  • Pivot tables: For tabular data and multi-dimensional analysis.
  • KPIs, Geospatial Charts, Donut Charts, Gauge Charts, Word Clouds, etc.

AWS CloudTrail

AWS CloudTrail provides governance, compliance, and audit for AWS accounts. CloudTrail is enabled by default. We can get a history of events/API calls made within the AWS account by:

  • Console
  • SDK
  • CLI
  • AWS services

We can put logs from CloudTrail into CCloudWatch Logs or S3. A trail can be applied to All Regions (default) or a single Region.

If a resource is deleted in AWS, we should investigate first the CloudTrail logs to understand who deleted the resource and why. We can also use CloudTrail to monitor API calls for security analysis, resource change tracking, and compliance auditing.

There are three types of events in CloudTrail:

  • Management events:
    • These are operations that are performed on resources in our AWS account.
      • Configuring security (IAM AttachRolePolicy)
      • Configuring rules for routing data (Amazon EC2 CreateSubnet)
      • Setting up logging (AWS CloudTrail CreateTrail)
      • By default, trails are configured to log management events.
      • We can seperate Read Events fro Write Events (modify resources)
  • Data events:
    • By default, data events are not logged (because high volume operations)
    • Amazon S3 object-level activity (ex: GetObject, DeleteObject, PutObject): We an seperate Read and Write Events.
    • AWS Lambda function exceution activity (When someone use the Invoke API)
  • CloudTrail Insights events
    • Enable CloudTrail Insights to detect unusual activity in you account:
      • Inaccurate resource provisioning.
      • hitting service limits.
      • Bursts of AWS IAM actions.
      • Gaps in periodic maintenance activity.
    • CloudTrail Insights analyzes normal managemetn events to create a baseline and then continously analyzes write events to detect unusual patterns.
      • The Anomalies appear in the CloudTrail console.
      • The Events is sent to Amazon S3.
      • An EventBridge event is generated (for automation needs).

Events are stored for 90 days in CloudTrail. We need to log them to S3 and use Athena to analyze them for longer retention.

AWS Config

AWS Config helps with auditing and recording compliance of AWS resources. It helps record configurations and changes over time. Questions like:

  • Is there unrestricted SSH access to my security groups?
  • Do my buckets have any public access?
  • How has my ALB configuration changed over time?

We can receive alerts (SNS notifications) for any changes. AWS Config is a per-region service and can be aggregated across regions and accounts. It is possible to store the configuration data into S3 and analyze it with Athena.

Config Rules

We can use AWS managed config rules (over 75) and can make custom config rules (must be defined in AWS lambda). For example, we can evaluate:

  • if each EBS disk is of type gp2.
  • if each EC2 instance is t2.micro.

Rules can be evaluated and triggered for each config change or at regular time intervals.

AWS Config Rules does not prevent actions from happening, but it can trigger alerts and notifications when a rule is violated. We can also use AWS Systems Manager Automation to automatically remediate non-compliant resources based on AWS Config Rules.

Config Resource

With Config Resource,

  • We can view compliance of a resource over time
  • View configuration of a resource over time.
  • We can also view the CloudTrail API calls of a resource over time.

Config Rules - Remediations

It can automate remediation of non-compliant resources using SSM Automation Documents and it can trigger Auto-Remediation action, for example, SSM Document: AWSConfigRemediation-RevokeUnusedIAMUserCredentials. This action will then revoke any unused credentials for an IAM user that is found to be non-compliant with the rule.

We can set Remediation Retries if the resource is still non-compliant after auto-remediation.

Config Rules - Notifications

  • Use Eventbridge to trigger notifications when AWS resources are non-compliant.
  • Ability to send configuration changes and compliance state notifications to SNS (all events - use SNS Filtering or filter at client-side).

CloudWatch vs CloudTrail vs Config

  • CloudWatch:

    • Performance monitoring and dashboards.
    • Events and Alerting.
    • Log Aggregation and analysis.
  • CloudTrail:

    • Record API calls made with your Account by everyone.
    • Can define trails for specific resources.
    • Global Service.
  • Config:

    • Record configuration changes.
    • Evaluate resource against compliance rules.
    • Get timeline of changes and compliance.

An example of using these for an Elastic Load Balancer (ELB):

  • CloudWatch:
    • Monitoring Incoming connections metric.
    • Visualize error codes as a percentage over time.
    • Make a dashboard to get an idea ofyour load balancer performance.
  • Config:
    • Track security group rules for the Load Balancer.
    • Track configuration changes for the Load Balancer.
    • Ensure an SSL certificate is always assigned to the Load Balancer (compliance).
  • CloudTrail:
    • Track who made any changes to the Load Balancer with API calls.

AWS Budgets

We ca create buget and send alarms when costs exceeds the budget. There are 4 tppes of budgets: Usage, Cost, Reservation, Savings Plans.

  • For Reservation Instances (RI)
    • Track utilization
    • Supports EC2, ElastiCache, RDS, Redshift.
  • We can have up to 5 SNS notifications per budget.
  • Can filter by: service, linked accoubt, Tag, Purchase Option, Instance Type, Region, Availability Zone, API Operation, etc…
  • 2 budgets are free and then we pay 0.02$ per additional budget per day.

AWS Cost Explorer

AWS Cost Explorer is a tool that allows us to visualize, understand, and manage aws costs and usage over time. It creates cusom reports that analyze cost and usage data.

We can analyze the data at a high level: total costs and usage across all accounts. We can get monthly, hourly, resource level granularity. I allow us to choose an optimal Savings Plan to lower prices on our bill. We can forecast usage up to 12 months based on previous usage.

Cost explore can also propose savings plan alternatives to reserved instances. We can also get forecast usage and costs for the next 12 months based on historical data.

AWS Trusted Advisor

We do not need to install anything, it gives us a high level AWS account assessment. It checks for best practices like do we ahve Amazon EBS public Snapshots, do we have Amazon RDS Public Snapshots, do we ahve IAM use.

Analyze are grouped into 6 categories:

  • Cost Optimization
  • Performance
  • Security
  • Fault Tolerance
  • Service Limits
  • Operational Excellence

For Business and enterprise support plans, we have:

  • Full Set of Checks.
  • Programmatic Access using AWS Support API.