REL 2: How do you proactively detect and maintain tenant health?

Managing the reliability of a SaaS environment requires operational tools that can detect issues that might impact the availability or experience of individual tenants. A resilient SaaS environment supports tenant-aware operations that enable proactive detection and resolution of tenant and system health issues.

Resources

Amazon CloudWatch Observability of your AWS resources and applications on AWS and on-premises
AWS re:Invent 2019: Intuit: Moving from monitoring to observability using Amazon OpenSearch (ANT330)
Analyzing Log Data with CloudWatch Logs Insights
How to better monitor your custom application metrics using Amazon CloudWatch Agent
AWS X-Ray Analyze and debug production, distributed applications

Best Practices:

Add tenant context to application logs to reactively manage tenant health: Log files are enriched with tenant context and analyzed by operations teams to reactively identify and troubleshoot reliability issues. Tenant context is used to identify specific tenant activity that might be contributing to system or tenant stability or availability issues.
Proactively identify tenant issues with policies and alarms: Combine rich tenant insights with policies to proactively surface tenant issues before they impact the stability or availability of the environment. These policies might invoke self-healing strategies for individual tenant and surface alerts and alarms.
Introduce detailed tenant insights to enhance health forensics: Publish detailed tenant activity, consumption, performance, and error data to a centralized repository that can be used to analyze health issues impacting reliability. Use this data to identify challenging multi-tenant reliability events.

Improvement Plan

Add tenant context to application logs to reactively manage tenant health

Inject tenant context into application log files

Introduce a wrapper around your logging framework that can acquire the tenant context and inject this context into each log message. Include any tenant attributes that can assist with analyzing the tenant activity.

Use log analytics tools and the injected tenant context to analyze tenant activity and consumption trends. Use these insights to troubleshoot stability and reliability issues that might be impacting tenants or tiers.

Use CloudWatch Logs or Amazon OpenSearch Service to investigate and analyze tenant issues, creating views of activity that are constrained to a specific tenant or view.

Proactively identify tenant issues with policies and alarms

Enable operations to configure tenant alerts and alarms

Identify specific patterns of tenant consumption, activity, and SLAs metrics that can be combined and used to proactively identify tenant health issues.
Configure alerts and alarms that are triggered when tenants reach specific health states or performance thresholds and might be a precursor to a reliability issue.

Apply self-healing strategies to address tenant reliability

Use automation to apply changes to address reliability and stability issues before they impact the tenant or system.

Introduce detailed tenant insights to enhance health forensics

Publish insights that enhance the visibility into tenant activity

Introduce detailed reliability metrics instrumentation that surfaces insights into issues that tenants or tiers of tenants are experiencing.
Add metrics that provide a view of tenant latency, potential bottlenecks, and feature consumption that allow operational teams to easily correlate performance or error conditions with specific tenant workflows.
- Augment tenant custom metrics with AWS metrics to create a holistic view of tenant health.
  Amazon CloudWatch Publishing Custom Metrics
Include tenant reliability metrics as part of the operational experience
- Enable operational users to easily detect and troubleshoot reliability issues using tools. This capability can be achieved with AWS services (Amazon OpenSearch, Amazon QuickSight, etc.) or with third-party tools that are used to analyze cloud metrics.