Source linked

ProGlove's Serverless Fleet: Observability Costs Doubled Their AWS Bill

aws.amazon.com@systems_wire1 hour ago·Systems Engineering·2 comments

At 1 million Lambda functions across thousands of accounts, ProGlove found that forwarding metrics and logs cost more than compute-and forced a redesign of their event routing to eliminate SQS polling.

progloveaws lambdaserverlessobservabilityaws cloudformationdistributed systems

Forwarding all observability data nearly doubled ProGlove’s cloud bill at peak scale. That single line from their postmortem on scaling to 1 million Lambda functions across thousands of AWS accounts is the kind of detail most architecture blogs gloss over. ProGlove, the company behind smart wearable barcode scanners for frontline workers, didn't just hit the number—they documented exactly what broke and what they rebuilt.

Observability Costs Nearly Doubled the Bill

At several dozen accounts, $3 per account per month for a third-party observability platform felt like pocket change. At thousands of accounts, that line item became an impactful expense that demanded active management. ProGlove found that forwarding all CloudWatch logs and metrics cross-account cost more than Lambda compute or storage. The fix: differentiate high- vs. low-priority observability data and only move what matters. They brought per-account observability costs down to $0.70, and for inactive accounts they switched to monitoring only a handful of basic metrics, dropping that near zero.

Synchronized Schedules Caused a Self-DDoS

Every Lambda function was using the same rate(5 minutes) expression, aligned to the top of the minute across thousands of accounts. The result: a massive metric spike that overwhelmed internal APIs—a self-DDoS. ProGlove’s rule of thumb became "Never do the same thing at the same time everywhere." They built an internal library that enforces jitter, randomized batch offsets, and staggered updates across all scheduled functions.

Removing SQS Slashed Polling Costs Without Sacrificing Reliability

Traditional serverless best practices put an SQS queue between EventBridge and Lambda for resilience. At scale, idle queues still accumulate polling costs—Lambda constantly makes requests even when no messages exist. ProGlove removed Amazon SQS from that path entirely. They replaced the buffer with metric-driven safety: monitoring AsyncEventsDropped and ConcurrentExecutions to stay within quotas without losing events. For dead-letter queues, they moved from per-account polling to a centralized DLQ where the AWS account ID serves as the tenant identifier, accepting the isolation trade-off with extreme discipline.

From StackSet Bottlenecks to AWS Service Team Collaboration

AWS CloudFormation StackSets worked beautifully at 50 accounts. At 1 million Lambda functions, StackSets hit a performance ceiling and produced errors that compounded. ProGlove started building a custom deployment engine—until the AWS CloudFormation service team noticed and partnered directly with them to prioritize stability and performance improvements. That collaboration let them keep StackSets as the core mechanism while building a deployment tracking service on Amazon EventBridge and Step Functions as a single-pane-of-glass for retries.

A mono-repo for 20 microservices enforces consistent tooling, security scanning, and runtime upgrades across all 1 million functions. The team now treats $0.70 per account as the baseline for idle monitoring, with 'almost-zero' under $1 per month—proving that scaling efficiency faster than growth is the only sustainable path when you operate thousands of AWS accounts.


Source: Lessons learned from scaling to 1 million Lambda functions
Domain: aws.amazon.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.