May brought CloudWatch Investigations, which have fundamentally changed someone’s on-call rotation. Instead of manual log correlation, this service uses AI to perform Automated Root Cause Analysis (RCA).
When an alarm triggers, Investigations automatically traces the error. It correlates metric spikes with concurrent events—like a specific Git commit, a Terraform apply, or an RDS parameter change. Instead of a dashboard showing “500 Errors,” you get a report saying: “The latency spike in Service A was caused by a configuration change in Service B that triggered a connection leak in RDS.”
Combined with CloudWatch RUM Session Replay, you can now visually reproduce the user’s journey leading up to a crash. It’s effectively a “Time Machine” for your infrastructure. If you’re still doing manual log-diving during incidents, May gave you back your weekends.