A customer auditor asked two questions that made an ML lead's stomach drop: "Which exact model weights are running, and who reviewed the code that loads them?" We knew the model's name on Hugging Face. We had no idea which revision we were actually running, because our Dockerfile pulled "latest" from that repo every time we rebuilt the image. Worse, our loading code passed trust_remote_code=True, which meant a Python file written by a stranger, hosted on someone else's repo, ran automatically inside our container every time it started.
Nobody on the team could tell the auditor what was in that file, because nobody had ever opened it. We'd spent real effort locking down our Dockerfiles—pinned base images, no baked-in secrets, non-root users—and then quietly let an entirely different supply chain walk in through the model-loading code without a second glance.
The Unreviewed Supply Chain Step
Traditional container supply chain advice is about code dependencies: pin your packages, scan your base image, don't pull untrusted Docker Hub images blindly. That advice still holds, but the AI era adds an artifact type that doesn't fit cleanly into any of it: the model itself. A checkpoint downloaded from a model hub isn't just data the way a CSV or config file is.
Pickle-based weight files can construct arbitrary Python objects during deserialization. Even when a model ships in the safer safetensors format, library authors increasingly support trust_remote_code, which means the model repo can bundle its own Python modeling code that runs automatically at load time. We'd been treating that flag the way people used to treat curl | bash: a thing everyone does because it's convenient and almost nobody stops to read first.
How We Got There: Demo Pressure and Unpinned Tags
Months earlier, ahead of a demo, we needed a model architecture not yet merged into mainline transformers. The only path was the model author's own custom code, loaded via trust_remote_code=True. The demo succeeded, and that line of code remained in the Dockerfile long after the deadline pressure subsided. No one scheduled time to revisit it because revisiting it was unimportant—until a customer's audit made it important.
Two more habits compounded the problem. First, we referenced the model by its repo name with no pinned revision, so the exact weights underlying that name could change upstream between builds without triggering anything resembling a code review—the same risk class as an unpinned floating package version, except there's no CVE database tracking model repos like PyPI. Second, we baked the downloaded weights directly into the image during the build, meaning the only record of what we'd shipped was whatever happened to be on the hub the day CI last ran.
The Fix: Pinning, Vendoring, and a Manifest
The fix wasn't exotic, just overdue. We started pinning every model reference to a specific commit revision instead of a branch name:
# before - floating reference, trusts remote code blindly
model = AutoModel.from_pretrained(
"some-org/custom-extractor",
trust_remote_code=True,
)
# after - pinned revision, no implicit remote execution
model = AutoModel.from_pretrained(
"some-org/custom-extractor",
revision="a1b2c3d4e5f6",
trust_remote_code=False,
)
Where the custom architecture code was genuinely necessary, we vendored the specific file into our repository, gave it an actual code review, and imported it locally. That means we no longer receive upstream fixes automatically—someone is responsible for re-pulling and re-reviewing on our schedule. That's the right trade. A dependency that requires your attention for updates is safer than one that updates itself without your knowledge.
We also pinned our base image by digest instead of tag and moved off a community CUDA image we'd picked years earlier because it "just worked," with no idea who maintained it. Then we started writing a small manifest alongside every image build—nothing fancy, just a JSON file recording the base image digest, the pinned model revision, and a SHA256 of the weight files actually shipped. A lightweight manifest beats no record at all; full provenance tooling can wait until the team grows.
These efforts are only worth doing for production systems or widely used tools. Pinning revisions and vendoring remote code is friction, and friction has a cost. The line I'd draw is whether the artifact touches production data, a customer, or a deployment boundary. Below that line, iterate quickly. Above it, the few hours of review are cheap compared to explaining to an auditor that nobody read the code that's been running in your containers for eight months.
We spent a lot of energy locking down Dockerfiles and almost none on the model-loading code sitting right next to them because one felt like infrastructure and the other felt like a research detail. That split doesn't hold up anymore. If a model checkpoint can execute code on load the same way a package can, it needs the same review gate a new dependency gets—which raises an awkward question most orgs haven't answered yet: whose job is that review, the security team's or the ML team's, and does either one currently think it's theirs?
Source: Nobody Reviewed the Model. They Just Reviewed the Code Around It
Domain: hackernoon.com
Comments load interactively on the live page.