REL 2: How do you proactively detect and maintain tenant health?
Managing the reliability of a SaaS environment requires operational tools that can detect issues that might impact the availability or experience of individual tenants. A resilient SaaS environment supports tenant-aware operations that enable proactive detection and resolution of tenant and system health issues.
Resources
Amazon CloudWatch Observability of your AWS resources and applications on AWS and
on-premises
AWS re:Invent 2019: Intuit: Moving from monitoring to observability using Amazon OpenSearch
(ANT330)
Analyzing Log Data with CloudWatch Logs Insights
How to better monitor your custom application metrics using Amazon CloudWatch Agent
AWS X-Ray Analyze and debug production, distributed applications
Best Practices:
-
Add tenant context to application logs to reactively manage tenant health: Log files are enriched with tenant context and analyzed by operations teams to reactively identify and troubleshoot reliability issues. Tenant context is used to identify specific tenant activity that might be contributing to system or tenant stability or availability issues.
-
Proactively identify tenant issues with policies and alarms: Combine rich tenant insights with policies to proactively surface tenant issues before they impact the stability or availability of the environment. These policies might invoke self-healing strategies for individual tenant and surface alerts and alarms.
-
Introduce detailed tenant insights to enhance health forensics: Publish detailed tenant activity, consumption, performance, and error data to a centralized repository that can be used to analyze health issues impacting reliability. Use this data to identify challenging multi-tenant reliability events.
Improvement Plan
Add tenant context to application logs to reactively manage tenant health
- Introduce a wrapper around your logging framework that can acquire the tenant context and inject this context into each log message. Include any tenant attributes that can assist with analyzing the tenant activity.
- Use CloudWatch Logs or Amazon OpenSearch Service to investigate and analyze tenant issues, creating views of activity that are constrained to a specific tenant or view.
Proactively identify tenant issues with policies and alarms
- Identify specific patterns of tenant consumption, activity, and SLAs metrics that can be combined and used to proactively identify tenant health issues.
- Configure alerts and alarms that are triggered when tenants reach specific health states or performance thresholds and might be a precursor to a reliability issue.
- Use automation to apply changes to address reliability and stability issues before they impact the tenant or system.
Introduce detailed tenant insights to enhance health forensics
- Introduce detailed reliability metrics instrumentation that surfaces insights into issues that tenants or tiers of tenants are experiencing.
- Add metrics that provide a view of tenant latency, potential bottlenecks, and feature consumption that allow operational teams to easily
correlate performance or error conditions with specific tenant workflows.
- Augment tenant custom metrics with AWS metrics to create a holistic view of tenant
health.
Amazon CloudWatch Publishing Custom Metrics
- Augment tenant custom metrics with AWS metrics to create a holistic view of tenant
health.
- Include tenant reliability metrics as part of the operational experience
- Enable operational users to easily detect and troubleshoot reliability issues using tools. This capability can be achieved with AWS services (Amazon OpenSearch, Amazon QuickSight, etc.) or with third-party tools that are used to analyze cloud metrics.