Understanding and Harnessing Sparsity in Unified Multimodal Models

Motivation

Unified multimodal models do not have uniform redundancy.

Understanding and generation share one system but stress different components. We use training-free pruning as a probe to reveal which components are compressible, which are fragile, and where sparse activation can help.

Component sensitivity Understanding modules show notable compressibility across task regimes, especially in generation tasks, while generation modules are more tightly coupled to output fidelity.

Task-dependent sparsity Activation patterns vary across samples and tasks, so static pruning alone can miss useful dynamic structure.

Adaptation target MoE adaptation turns observed dynamic activation into sparse routing for the generation module.

Evidence

Pruning exposes an asymmetric compression landscape.

Depth and width reduction show that unified models are not uniformly compressible across understanding and generation components.

Depth pruning results under generation tasks — Depth pruning indicates that understanding components can tolerate larger reduction in generation tasks.

Width reduction for understanding components under understanding tasks — Understanding-task evaluation separates generally redundant capacity from task-specific sensitivity.

Width reduction for understanding components under generation tasks — Generation-task evaluation further reveals compressibility in understanding components.

Dilemma of compressing generation components — Generation components are more fragile: moderate compression can cause visible quality degradation.

Method

Dynamic sparsity motivates MoE adaptation.

Training-free analysis identifies static redundancy; sparse MoE adaptation uses sample-dependent activation to recover generation quality with fewer active parameters.

Training-Free Component Analysis

Depth pruning and width reduction probe how much each component can be compressed across understanding and generation regimes.

Sparse MoE Adaptation

The generation module is partitioned into experts and sparsely activated, enabling dynamic routing while preserving generation quality.

Activation sparsity patterns across layers — Activation statistics reveal sample-dependent sparsity patterns that motivate expert partitioning.

Comparison between slimness and sparse adaptation — MoE-based sparse adaptation restores generation quality beyond training-free slimming alone.

Resources

MoE adaptation checkpoints are available on Hugging Face.

The released checkpoints activate half of the generation experts at inference time.

Model	Experts	Hugging Face
BAGEL-MoE-7B-GEN-16to8	16 total, 8 active	LLM-Drop/BAGEL-MoE-7B-GEN-16to8
BAGEL-MoE-7B-GEN-32to16	32 total, 16 active	LLM-Drop/BAGEL-MoE-7B-GEN-32to16

Code Repository Paper

Citation

Cite this work.

If this project helps your research, please cite the paper.

@misc{he2025understandingharnessingsparsityunified,
  title={Understanding and Harnessing Sparsity in Unified Multimodal Models},
  author={Shwai He and Chaorui Deng and Ang Li and Shen Yan},
  year={2025},
  eprint={2512.02351},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2512.02351},
}