An AI agent that consumes 266,900 tokens with a 0% error rate isn't healthy — it's stuck in an infinite loop, and standard dashboards won't tell you that.
Amazon Bedrock AgentCore's new observability layer gives you three tools: CloudWatch metrics for system health, OpenTelemetry traces for step-by-step execution, and structured logs for root‑cause analysis. Each piece fills a hole that standard APM left open. Let me show you how they work on real failure patterns.
The 177-Span Loop Your Dashboards Miss
Quality failures — correct output that's factually wrong — and efficiency problems like infinite loops rarely trigger error alerts. In one trace from the AWS blog, a single agent execution ran 177 spans with an average latency of 85,590 ms. A normal response completes in 1–5 seconds. The GenAI Observability Dashboard showed 266.9K total tokens and a pristine 0% error rate.
The CloudWatch Logs Insights query that caught it: filter TokenUsage > 10000 | sort TokenUsage desc. The top session had 177 spans. The trace waterfall revealed the root cause in the system prompt: "never give up" with no termination condition. The agent recalculated 25% of 100 repeatedly, getting values like 24.95% and 25.049%, never landing on exactly 25.00%.
Fix: add explicit termination conditions to your prompt (e.g., "after three identical attempts, stop and explain why"), set a maximum token limit per session (5,000–10,000 for conversational agents), and cap reasoning steps at 10–15.
86 Repeated Tool Calls and a Prompt That Never Gives Up
Loop detection failures look different. You'll see identical tool invocations in sequence — not just a few retries, but 86 repeated calls to calculate_percentage with near-identical inputs. The Logs Insights query filter Operation like /InvokeTool/ surfaces the pattern: {"value": 25, "total": 100} returning 24.954% again and again.
The agent had no logic to recognize it had already tried the same thing. The fix is straightforward: track tool invocations and reasoning steps, force termination after three identical repeated actions, and create a CloudWatch alarm on average token usage per session. Catching loops early prevents runaway costs.
Incorrect tool selection — a third failure pattern — shows up in the agent's reasoning logs. An agent asked to calculate 25% of 100 selects web_search instead of calculator, then re‑searches with different terms. Clearer tool descriptions with explicit usage examples in the agent's configuration fix that. For example: { "name": "calculator", "description": "Use for mathematical calculations, including percentages. Example: calculating 25% of 100." }.
Five Status Codes That Pinpoint Tool Failure
Tool invocation failures generate real errors — 401, 403, 400, 404, 500 — and your CloudWatch dashboard will show elevated error rates. The challenge is speed: which tool, which error, and what to fix first.
Run filter StatusCode like /4[0-9][0-9]|5[0-9][0-9]/ | stats count(*) by ToolName, StatusCode. In the blog's example, Exception errors (45 occurrences) dominated, pointing to validation failures as the top root cause. A 401 or 403 means the Gateway service role attached to the agent's gateway lacks permissions. Update the IAM policy to include the specific action, like lambda:InvokeFunction, on the target function ARN.
For 400 errors, compare the agent's input against the tool's expected schema. The agent might pass {"customer_id": 12345, "amount": "100.00"} when the tool expects a string for customer_id and a number for amount. Fix by updating the tool's schema or by adding input validation in the agent's prompt. For 404 and 500 errors, check the tool's own logs — the problem may not be in the agent at all.
Set a CloudWatch alarm when any tool's error rate exceeds 5%. Convert your diagnostic Logs Insights queries into persistent dashboard widgets so your team sees agent health without rerunning queries manually. For automated tool accuracy at scale, use Bedrock AgentCore Evaluators to score agent sessions in real time.
Structured observability turns hours of guesswork into minutes of targeted investigation. The CloudWatch Logs Insights queries from this post are ready to paste into your production environment today.
Source: Debugging production agents with Amazon Bedrock AgentCore Observability
Domain: aws.amazon.com
Comments load interactively on the live page.