Source linked

Unsloth يغطي 744B-Parameter GLM-5.2 إلى 239GB لـ Local Mac Inference

يتم تقسيم نموذج Unsloth Dynamic 2-bit GGUF إلى نموذج 1.51TB بنسبة 84٪ ، مما يسمح للمنتج المفتوح بالعمل على 256 جيجابايت من الذاكرة المشتركة Mac أو مع 24 جيجابايت واحدة من GPU من خلال تثبيت الذاكرة.

unslothz aiglm 52dynamic quantizationllama cpplocal ai

GLM-5.2, Z.ai's 744B-parameter open model with 40B active parameters and a 1M context window, now runs on a 256GB unified memory Mac thanks to Unsloth's Dynamic 2-bit GGUF quant that shrinks the model from 1.51TB to 239GB. That's an 84% size reduction with only a modest accuracy hit.

239GB from 1.51TB: The Quant Math

Unsloth's Dynamic GGUF scheme doesn't apply uniform precision. It upcasts important layers to 8 or 16-bit while quantizing the rest. For GLM-5.2, the 2-bit quant (UD-IQ2_M) lands at 239GB and fits on any machine with at least 245GB of total memory (RAM + VRAM). The 1-bit quant drops to 217GB, an 86% reduction, requiring 223GB RAM. Both quants work with MoE offloading on a single 24GB GPU if you have 256GB system RAM.

Z.ai gave Unsloth day-zero access to the model, which explains the fast turnaround. The full unquantized model demands 1.51TB of disk space, making local inference impossible without aggressive quantization.

Accuracy vs. Size: Where You Pay the Price

Unsloth ran KL divergence to measure quantization fidelity. At pure top-1% accuracy, the dynamic 1-bit quant scores 76.2% but is 86% smaller. The 2-bit quant hits 82% accuracy while being 84% smaller. For most tasks, these losses are tolerable. If you need near-lossless performance, the dynamic 4-bit (UD-Q4_K_XL) and 5-bit quants are described as "generally lossless" by Unsloth, though they require 372-570GB of memory.

Mean KLD follows a clean monotonic trend against disk space, confirming that even the 1-bit quant preserves usable quality. Unsloth recommends the 2-bit quant as the best balance of accessibility and accuracy for local deployment.

Running It: Local Setup with llama.cpp and Unsloth Studio

You can load GLM-5.2 directly in llama.cpp. The process is similar to ollama run but with explicit quantization selection. Unsloth recommends manual download via huggingface_hub to avoid timeout issues. The model supports three thinking modes: non-thinking, High, and Max. Use Max for complex agentic or coding tasks. Disable thinking with --chat-template-kwargs '{"enable_thinking":false}'.

Unsloth Studio, an open-source web UI, adds auto-offloading, multi-GPU detection, tool calling, and code execution. It handles the GGUF download and configuration automatically. Inference parameters like temperature (1.0) and top_p (0.95-1.0) are pre-set for optimal SWE-Bench Pro performance.

GLM-5.2 is the strongest open model on Artificial Analysis benchmarks, matching Claude 4.8 Opus, GPT-5.5, and Gemini 3.1 Pro. The 3-bit quants need 290-360GB and the 8-bit quant needs 810GB, pushing the envelope of what consumer hardware can accommodate. But the path from 1.51TB to 217GB is a concrete reminder that quantization, when done dynamically, can make state-of-the-art open models accessible on hardware you can buy today.


Source: Unsloth GLM-5.2 - How to Run Locally
Domain: unsloth.ai

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.