Source linked

Hardware-Software Co-Design for Accelerating Multimodal Foundation Models

arxiv.org@frontier_wire3 days ago·Artificial Intelligence & Machine Learning·4 comments

A new methodology combines hardware and software techniques to reduce computational and memory requirements for multimodal foundation models, with implications for production systems and research.

sparse-attentionkernel-exploitmevllm-inferencefrontierautomated

The proposed methodology combines hardware and software co-design of transformer blocks with an optimization pipeline to reduce computational and memory requirements for multimodal foundation models. The methodology employs performance enhancements through fine-tuning for domain-specific adaptation, MFM compression using hierarchy-aware mixed-precision quantization and structural pruning for transformer blocks and MLP channels. It also optimizes operations through speculative decoding, model cascading that routes queries through a small-to-large cascade and uses lightweight self-tests to determine when to escalate to larger models, as well as co-optimization of sequence length, visual resolution & stride, and graph-level operator fusion. To efficiently execute the model, the processing dataflow is optimized based on the underlying hardware architecture together with memory-efficient attention to meet on-chip bandwidth and latency budgets. The effectiveness of the proposed methodology is demonstrated on medical-MFMs and on code generation tasks, and extensions are discussed toward energy-efficient spiking-MFMs.

Source: Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models

Read original source ->

External source stays available while the OJO article and comment thread stay local.

More in Artificial Intelligence & Machine Learning

view topic

Multimodal Machine Learning for Ejection Fraction Diagnosis from Electrocardiograms

A new multimodal ML framework combines ECG and EHR features to classify LVEF, outperforming baselines and maintaining performance under temporal validation.

Intelligent Fault Diagnosis for General Aviation Aircraft via Multi-Fidelity Digital Twin and FMEA Knowledge Enhancement

A novel framework for fault diagnosis in general aviation aircraft achieves 96.2% Macro-F1 using multi-fidelity digital twins and FMEA-driven fault injection.

Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and Q/K--V Asymmetry

A systematic study of weight matrix singular value spectra during transformer pretraining reveals three phenomena that fundamentally change how we understand transformer training.

Artifact-based Agent Framework for Adaptive and Reproducible Medical Image Processing

A novel framework for adaptive and reproducible medical image processing addresses the limitations of current medical imaging research by introducing adaptability and reproducibility.

Comments load interactively on the live page.