Huntington Bank редактирует более 400 млн документов за месяцы, а не за годы с помощью трубопроводов AWS

400 million documents, accumulated over nearly a decade, needed systematic redaction of Social Security numbers, account numbers, and addresses. Huntington Bank's initial timeline for this compliance initiative stretched into years. They finished in months by building a pipeline on Amazon Textract, AWS Step Functions, and Lambda that chewed through 10 million documents per day.

Why Sequential Processing Would Have Failed

Original estimates assumed a linear crawl through a decade's worth of on-premises files. Huntington's architects knew they needed both throughput and accuracy above 95%. AWS DataSync moved the 400M files from an SMB share to S3, encrypted in transit and at rest, with KMS managing the keys. That part was straightforward. The hard problem was detecting and redacting sensitive fields at scale without throttling or bottlenecking.

The Step Functions Distributed Map That Made 10M/Day Possible

Huntington used Step Functions' built-in map state in distributed mode to fire off concurrent Amazon Textract jobs. They organized documents in S3 into a JSON collection and let Step Functions iterate over it with controlled concurrency. CloudWatch dashboards tracked throttle counts and success rates; the team adjusted concurrency limits on the fly to stay under Textract's service quota while saturating it. When Textract returned detected fields with bounding box coordinates and confidence scores, the pipeline wrote metadata to S3 for validation. A wait state in the state machine ensured the next invocation didn't step on the previous one.

Redaction That Verified Before Burning

Huntington didn't just redact blindly. They built a validation step that double-checked detected fields against expected patterns (regex for SSNs, account numbers) before applying the redaction. Open-source libraries like PyMuPDF handled the actual image and PDF masking. Step Functions again orchestrated this second pass, providing retry logic and error hooks. Redacted files landed in an S3 bucket monitored by AWS DataSync for the return trip to on-premises storage.

The Numbers That Matter

Processing throughput hit 10 million documents per day. Redaction accuracy exceeded 95%, meeting PCI DSS compliance requirements. The total cost came in at roughly 5% of the original estimate. Huntington plans to reuse this framework for merger-related document processing, where high-volume redaction is a recurring headache.

Huntington's approach proves that a well-designed Step Functions workflow plus a purpose-built ML extraction service can turn a multi-year compliance nightmare into a quarter-long project. For anyone sitting on a mountain of legacy documents, the lesson is clear: parallelize or perish.

Source: Huntington Bank: Redacting sensitive data from 400M+ documents with AWS
Domain: aws.amazon.com