Source linked

AWS automation cuts infrastructure discovery from weeks to 2-4 hours

AWS's five-layer AI-powered resilience framework maps dependencies in hours, generates targeted chaos experiments without specialist expertise, and embeds continuous validation into CI/CD pipelines.

awsaws resilience hubbedrock agentcoreaws fault injection servicechaos engineeringci cd

Discovery of infrastructure dependencies went from a weeks-long manual slog to a 2-4 hour automated scan, according to a new AWS architecture guide released June 22.

That single number - 2-4 hours for initial mapping of single-account environments with thousands of resources - is the sharpest edge of a five-layer AI-powered resilience framework AWS just detailed. Subsequent runs process only changes tracked by AWS Config, so the map stays current without re-scanning everything.

The five layers: from discovery to continuous validation

Layer 1 (Discovery) combines two data sources. The next generation of AWS Resilience Hub natively discovers AWS services, internal endpoints, and third-party endpoints. A custom agent on Amazon Bedrock AgentCore extends that with code-level analysis: it scans repositories for hard-coded dependencies, connection strings, timeout configurations, and retry logic that infrastructure-level discovery alone misses. The agent runs in AgentCore Runtime, which provides MicroVM session isolation and supports sessions up to eight hours.

Layer 2 (Test Generation) converts the discovered dependency map into executable AWS Fault Injection Service experiment templates. Each hypothesis gets a business impact score based on application tier definitions in Resilience Hub, architectural patterns (internet-facing load balancers, API Gateway endpoints), and resource tags. The system detects when an application uses Amazon RDS Multi-AZ but lacks proper connection retry handling - and designs a database failover test that validates actual recovery mechanisms rather than generic network disruption.

Layer 3 (Experimentation) runs those tests with progressive scope expansion starting at 1% of resources, progressing to 5%, 10%, 25% based on risk tolerance. Amazon CloudWatch alarms serve as stop conditions that halt experiments before violating SLAs. AWS recommends setting alarm thresholds well below SLA limits: if your SLA allows 1% error rate, configure stop conditions at 0.1%.

Layer 4 (Gap Analysis) correlates experiment outcomes with resilience policies. Each gap gets a priority score based on severity, likelihood, and business impact.

Layer 5 (Continuous Validation) bakes resilience testing into CI/CD pipelines. Every commit triggers a lightweight policy-as-code check (using Open Policy Agent on Infrastructure as Code) that runs in seconds, catching missing health checks or single-AZ deployments before code reaches staging. Full resilience assessments - 15-20 experiments taking 15-45 minutes - run as a pre-production gate only on significant architectural changes. A separate tier runs 3-5 critical experiments (database failover, AZ loss, circuit breaker activation) on every deployment, adding roughly 2-3 minutes per pipeline run.

Phased rollout removes the expertise barrier

AWS recommends a three-phase rollout. Pilot (1-2 weeks, 2-3 engineers): select a non-critical application, enable AWS Config, deploy the discovery agent on Bedrock AgentCore, run a baseline Resilience Hub assessment. Expansion (4-6 weeks, cross-functional): scale to 3-5 applications, configure automated test creation, run experiments at 1% scope during low traffic. Enterprise (8-12 weeks, dedicated team): multi-account hub-and-spoke architecture, tiered resilience policies, centralized dashboards.

Actual cost: 4-6 hours for pilot implementation with a team of 2-3 engineers who know their AWS environment. That includes setting up AWS Resilience Hub, Fault Injection Service, Bedrock AgentCore, Systems Manager, Config, and CloudWatch. The framework claims to reduce MTTR by 50% and event costs by up to 58%, citing the 2024 IBM Security Services Benchmark Report - though those numbers come from organizations with mature response capabilities, not specifically from this framework.

What this changes next

The next frontier, according to AWS, is shifting validation even earlier: scanning Infrastructure as Code and application code for resilience anti-patterns at the pull request stage, before a single resource is deployed. When your CI/CD pipeline can flag a missing circuit breaker or a single-AZ dependency during code review, prevention becomes truly proactive.


Source: Architecting AI-powered resilience framework on AWS
Domain: aws.amazon.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.