arXiv 2025

Understanding and Harnessing Sparsity in Unified Multimodal Models

A component-level study of where unified multimodal models can be compressed, why generation modules are harder to prune, and how sparse MoE adaptation recovers generation quality.

Shwai He1,2, Chaorui Deng1, Ang Li2, Shen Yan1,†

1ByteDance Seed, 2University of Maryland, College Park

SparseUnifiedModel overview
Two-stage efficiency optimization: training-free component analysis followed by sparse MoE adaptation.

Motivation

Unified multimodal models do not have uniform redundancy.

Understanding and generation share one system but stress different components. We use training-free pruning as a probe to reveal which components are compressible, which are fragile, and where sparse activation can help.

Component sensitivity Understanding modules show notable compressibility across task regimes, especially in generation tasks, while generation modules are more tightly coupled to output fidelity.
Task-dependent sparsity Activation patterns vary across samples and tasks, so static pruning alone can miss useful dynamic structure.
Adaptation target MoE adaptation turns observed dynamic activation into sparse routing for the generation module.

Evidence

Pruning exposes an asymmetric compression landscape.

Depth and width reduction show that unified models are not uniformly compressible across understanding and generation components.

Depth pruning results under generation tasks
Depth pruning indicates that understanding components can tolerate larger reduction in generation tasks.
Width reduction for understanding components under understanding tasks
Understanding-task evaluation separates generally redundant capacity from task-specific sensitivity.
Width reduction for understanding components under generation tasks
Generation-task evaluation further reveals compressibility in understanding components.
Dilemma of compressing generation components
Generation components are more fragile: moderate compression can cause visible quality degradation.

Method

Dynamic sparsity motivates MoE adaptation.

Training-free analysis identifies static redundancy; sparse MoE adaptation uses sample-dependent activation to recover generation quality with fewer active parameters.

Training-Free Component Analysis

Depth pruning and width reduction probe how much each component can be compressed across understanding and generation regimes.

Sparse MoE Adaptation

The generation module is partitioned into experts and sparsely activated, enabling dynamic routing while preserving generation quality.

Activation sparsity patterns across layers
Activation statistics reveal sample-dependent sparsity patterns that motivate expert partitioning.
Comparison between slimness and sparse adaptation
MoE-based sparse adaptation restores generation quality beyond training-free slimming alone.

Resources

MoE adaptation checkpoints are available on Hugging Face.

The released checkpoints activate half of the generation experts at inference time.

Model Experts Hugging Face
BAGEL-MoE-7B-GEN-16to8 16 total, 8 active LLM-Drop/BAGEL-MoE-7B-GEN-16to8
BAGEL-MoE-7B-GEN-32to16 32 total, 16 active LLM-Drop/BAGEL-MoE-7B-GEN-32to16

Citation

Cite this work.

If this project helps your research, please cite the paper.

@misc{he2025understandingharnessingsparsityunified,
  title={Understanding and Harnessing Sparsity in Unified Multimodal Models},
  author={Shwai He and Chaorui Deng and Ang Li and Shen Yan},
  year={2025},
  eprint={2512.02351},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2512.02351},
}