语言
主题
回到博客首页

AIMv2

摘要

A novel method for pre training of large scale vision encoders, based on autoregressive pretraining to a multimodal setting(image and text)…

A novel method for pre-training of large-scale vision encoders, based on autoregressive pretraining to a multimodal setting(image and text), 把自回归预训练扩展到多模态(图像+文本)场景.

The method is to pair the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. 方法是:用多模态解码器与视觉编码器配对,让解码器自回归地生成原始图像 patch 和文本 token

AIMv2 is a family of open vision models pretrained to autoregressively generate both image patches and text tokens. During pretraining, AIMv2 uses a causal multimodal decoder that first regresses image patches and then decodes text tokens in an autoregressive manner. Screenshot 2025-10-29 at 3.03.09 PM

Approach

Pretraining

  • an image xx is partitioned into II non-overlapping patches xi,i[1,I]x_{i}, i\in[1, I], forming a sequence of tokens.
  • a text sequence is broken down into subwords xt,t[I,I+T]x_{t}, t\in[I,I+T].
  • concatenate image tokens and text tokens(image+text or text+image都可以,但是选择 image+text,这样文本 token 在因果掩码下能看到全部已生成的图像 patch,从而更强地以视觉为条件,有利于训练出更强的视觉编码器;同时像素重建先发生在图像段,避免让图像过度依赖文本提示再去“补图”)

The sequence is thus:

P(S)=j=1I+TP(SjS<j)P(S)=\prod_{j=1}^{I+T}P(S_{j}|S_{<j})

making the model to autoregressively predict the next token in the sequence.

The pretraining setup:

  • a dedicated vision encoder that processes the raw image patches
  • then passed to a multimodal decoder alongside the embedded text tokens
  • the decoder subsequently performs next-token prediction on the combined sequence
  • vision encoder: prefix self-attention; multimodal decoder: causal self-attention

The loss function is designed separately for image and text domains:

Limg=1Ii=1Ix^i(x<i;θ)xi22l2 pixel-level regression lossLtext=1Tt=I+1I+TlogP(xtx<t;θ)Cross-Entropy\begin{align} L_{\text{img}}=\frac{1}{I}\sum_{i=1}^I \Vert \hat{x}_{i}(x_{<i};\theta)-x_{i}\Vert_{2}^2 \quad l_{2}\text{ pixel-level regression loss}\\ L_{\text{text}}=-\frac{1}{T}\sum_{t=I+1}^{I+T}\log P(x_{t}|x_{<t};\theta)\quad\text{Cross-Entropy} \end{align}

The overall objective is to minimize Ltext+αLimgL_{\text{text}}+\alpha \cdot L_{\text{img}} w.r.t. model param θ\theta. Normalize the images patches following He.

Use separate linear layers to map the final hidden state of the multimodal decoder to the appropriate output dimensions for image patches and vocabulary size for vision and language, respectively.

Architecture

The vision encoder is ViT.

Prefix Attention Randomly sample the prefix length as MU{1,2,,I1}M\sim \mathcal{U}\{1,2,\dots,I-1\}. The pixel loss is computed exclusively for non-prefix patches, defined as {xii>M}\{x_{i}|i>M\}.

  • used in vision encoder
  • facilitates the use of bidirectional attention during inference without additional tuning

SwiGLU and RMSNorm use SwiGLU as FFN and replace all norm layers with RMSNorm in both vision encoder and multimodal decoder

Multimodal Decoder

  • Image features and raw text tokens are each linearly projected and embedded into Rddec\mathbb{R}^{d_{\text{dec}}}.
  • decoder employs causal attention in the self-attention operations
  • the outputs are processed through 2 separated linear heads to predict the next token in each modality

Post-Training

High-resolution Adaptation