🥯

BAGEL: A New Paradigm for Unified Multimodal AI

Experience BAGEL (7B active parameters), an open-source foundational model by ByteDance's Seed team. Leading the way in AI with exceptional performance in multimodal understanding, text-to-image generation, and advanced image editing.

Quick Start View Source

What is BAGEL?

BAGEL is a cutting-edge open-source multimodal foundational model developed by ByteDance's Seed team, featuring 7 billion active parameters (14 billion total). It's meticulously trained on vast, interleaved multimodal data including language, images, videos, and web content.

At its core is a Mixture-of-Experts Transformer (MoT) architecture, designed to maximize the model's ability to learn from diverse multimodal information. Uniquely, BAGEL employs two separate encoders to capture pixel-level and semantic-level features of images, and is trained following a "next-group-of-tokens prediction" paradigm.

This powerful combination enables BAGEL not only to understand and generate content across modalities but also to perform complex tasks like free-form visual manipulation, multi-view synthesis, and even world navigation—capabilities that collectively constitute "world modeling" functions beyond traditional image editing.

Unleash Powerful Potential

Exceptional Understanding

BAGEL outperforms top open-source VLMs like Qwen2.5-VL and InternVL-2.5 on standard multimodal understanding leaderboards such as MME, MMBench, MMMU, MM-Vet, and MathVista.

High-Quality Image Generation

Text-to-image quality is comparable to powerful specialized generators like SD3. BAGEL can create impressive and novel visual content from text prompts.

Advanced Image Editing

Going beyond basic editing, BAGEL demonstrates excellent qualitative results in classic image editing scenarios and extends to free-form visual manipulation, surpassing other open-source models.

World Modeling Capabilities

Explore tasks like multi-view synthesis, future frame prediction, 3D manipulation, and world navigation, showcasing BAGEL's emergent "world modeling" abilities.

Mixture-of-Experts Architecture

Employs a Mixture-of-Experts Transformer (MoT) architecture to maximize learning from diverse multimodal data and enable specialized processing.

Fully Open Source

BAGEL is licensed under Apache 2.0, promoting transparency, collaboration, and community-driven development in multimodal AI.

Leading Benchmark Performance

BAGEL consistently outperforms existing open-source models on standard understanding and generation benchmarks and demonstrates strong performance in image editing.

Visual Understanding

Top-Tier

MME: 2388, MMBench: 85.0, MMMU: 55.3, MM-Vet: 67.2, MathVista: 73.1

Text-to-Image Generation

Highly Competitive

GenEval: 0.88 (BAGEL+CoT), WISE: 0.70 (BAGEL+CoT)

Image Editing

Advanced Level

IntelligentBench: 55.3 (BAGEL+CoT), GEdit-Bench (O): 6.52

Scores reflect BAGEL's performance, with some metrics showing results for BAGEL+CoT (Chain-of-Thought). For detailed benchmark tables, please refer to our GitHub repository.

Emergent Properties & Deep Insights

As the volume of multimodal data in BAGEL's pre-training increases, we observe continuous performance improvements in understanding, generation, and editing tasks. This is not just numerical growth, but an emergence of complex capabilities.

Different capabilities manifest at various stages of training: multimodal understanding and generation abilities appear early, followed by basic editing capabilities. More complex, intelligent editing abilities emerge later in the training process.

This staged progression reveals an emergent pattern: advanced multimodal reasoning is built upon solid foundational skills. Ablation studies further show that combining VAE and ViT features significantly improves intelligent editing effects, emphasizing the importance of rich visual semantic context.

🚀 Quick Start with BAGEL

Ready to dive deeper? Follow these simple steps to get BAGEL running on your system.

1 Environment Setup

git clone https://github.com/bytedance-seed/BAGEL.git
cd BAGEL
conda create -n bagel python=3.10 -y
conda activate bagel
pip install -r requirements.txt

2 Download Pre-trained Model

from huggingface_hub import snapshot_download

save_dir = "/path/to/save/BAGEL-7B-MoT" # Please customize this path
repo_id = "ByteDance-Seed/BAGEL-7B-MoT"
cache_dir = save_dir + "/cache"

snapshot_download(
  cache_dir=cache_dir,
  local_dir=save_dir,
  repo_id=repo_id,
  local_dir_use_symlinks=False,
  resume_download=True,
  allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)

3 Inference & WebUI

Explore BAGEL's powerful features:

Open the inference.ipynb file to start experiencing BAGEL in a Jupyter Notebook.
For a user-friendly graphical interface, use the Gradio WebUI:
```
pip install gradio
python app.py
```

For detailed training and evaluation instructions, please refer to the TRAIN.md and EVAL.md documents on GitHub.

Join the BAGEL Community

Report an Issue (Issue #11) Contribute on GitHub

Frequently Asked Questions (FAQ)

What is BAGEL?

BAGEL is an open-source multimodal foundational model developed by ByteDance's Seed team, with 7 billion active parameters (14 billion total). It is designed for unified multimodal understanding, high-quality text-to-image generation, and advanced image editing tasks, trained on large-scale interleaved multimodal data.

How is BAGEL different from other VLMs?

BAGEL's unique aspects include:

Exceptional Performance: Outperforms many leading open-source VLMs on standard benchmarks.
Unified Capabilities: Combines strong understanding, generation, and complex editing abilities in a single model.
Advanced Editing: Extends to free-form visual manipulation and "world modeling" tasks beyond typical editing.
MoT Architecture: Utilizes a Mixture-of-Experts Transformer to enhance learning capabilities.
Dual Encoders: Simultaneously captures pixel-level and semantic-level features of images.

What are BAGEL's key inference hyperparameters?

Key hyperparameters include:

cfg_text_scale: Controls adherence to the text prompt (typical values: 4.0–8.0).
cfg_image_scale: Controls how much detail from the input image is preserved (typical values: 1.0–2.0).
cfg_interval: Proportion of denoising steps where CFG is applied (typical values: [0.4, 1.0]).
timestep_shift: Adjusts the distribution of denoising steps (affects layout and details).
num_timesteps: Total number of denoising steps (typical value: 50).
cfg_renorm_min: Minimum value for CFG-Renorm (typical values: 0, 1.0 to disable).
cfg_renorm_type: CFG-Renorm method (global, channel, text_channel). If edited images are blurry, try 'global', and lower cfg_renorm_min or cfg_scale.

For more details, please refer to the GitHub documentation.

How do I get started with BAGEL?

Please follow the Quick Start guide above. This includes cloning the GitHub repository, setting up the Python environment, downloading the pre-trained model, and then using the provided inference.ipynb notebook or the Gradio WebUI (app.py).

Is BAGEL open source?

Yes, BAGEL is licensed under the Apache 2.0 License, making it free to use and modify.

Where can I find BAGEL's pre-trained models?

The pre-trained model (BAGEL-7B-MoT) is available for download from the Hugging Face Hub. The repository ID is ByteDance-Seed/BAGEL-7B-MoT. Instructions are provided in the Quick Start guide.

How can I contribute code or report issues?

We welcome all kinds of contributions and feedback!

To report poor performance or "bad cases," please use GitHub Issue #11.
Join our Discord server (link will be available on GitHub) for discussions.
For code contributions, please refer to the contribution guidelines in the main GitHub repository.