Spring Boot microservices live in an imperfect world in which the outages are around the corner. Recording, triaging, tracking and measuring incidents in this type of architectures is time-consuming and expensive. Our SRE/DevOps team designed a framework to test the resilience patterns, such as: circuit breaker, bulkhead, and health actuators using Chaos Engineering. This talk presents our journey and challenges designing the framework and automating a chaos gameday toolbox using a chaos monkey for Spring Boot with Pulumi.
Approach and methodologyFirst, the talk describes the foundations of resilience patterns and their implementations in Spring architectures. Second, it presents how these patterns are used to manage high-severity incidents, practice for recording, triaging, tracking, and assigning business value to problems that impact critical systems. Third, a framework based on chaos engineering and the scientific method to manage these incidents is presented.

SRE & Professor, Universidad Nacional de Colombia
Track: Cloud Native Platforms