REL 2: How do you proactively detect and maintain tenant health?

Managing the reliability of a SaaS environment requires operational tools that can detect issues that might impact the availability or experience of individual tenants. A resilient SaaS environment supports tenant-aware operations that enable proactive detection and resolution of tenant and system health issues.

Resources

Amazon CloudWatch Observability of your AWS resources and applications on AWS and on-premises
AWS re:Invent 2019: Intuit: Moving from monitoring to observability using Amazon OpenSearch (ANT330)
Analyzing Log Data with CloudWatch Logs Insights
How to better monitor your custom application metrics using Amazon CloudWatch Agent
AWS X-Ray Analyze and debug production, distributed applications

Best Practices:

Improvement Plan

Add tenant context to application logs to reactively manage tenant health

  • Inject tenant context into application log files
  • Use log analytics tools and the injected tenant context to analyze tenant activity and consumption trends. Use these insights to troubleshoot stability and reliability issues that might be impacting tenants or tiers.
  • Proactively identify tenant issues with policies and alarms

  • Enable operations to configure tenant alerts and alarms
  • Apply self-healing strategies to address tenant reliability
  • Introduce detailed tenant insights to enhance health forensics

  • Publish insights that enhance the visibility into tenant activity