Visual tokenizers are pivotal in multimodal large models, acting as bridges between continuous inputs and discrete token. Nevertheless, training high-compression-rate VQ-VAEs remains computationally demanding, often necessitating thousands of GPU hours. This work demonstrates that a pre-trained VAE can be efficiently transformed into a VQ-VAE by controlling quantization noise within the VAE's tolerance threshold. We present Quantize-then-Rectify (ReVQ), a framework leveraging pre-trained VAEs to enable rapid VQ-VAE training with minimal computational overhead. By integrating channel multi-group quantization to enlarge codebook capacity and a post rectifier to mitigate quantization errors, ReVQ compresses ImageNet images into at most \(512\) tokens while sustaining competitive reconstruction quality (rFID = \(1.06\)). Significantly, ReVQ reduces training costs by over two orders of magnitude relative to state-of-the-art approaches: ReVQ finishes full training on a single NVIDIA 4090 in approximately 22 hours, whereas comparable methods require 4.5 days on a 32 A100 GPUs. Experimental results show that ReVQ achieves superior efficiency-reconstruction trade-offs.
Large language models have revolutionized artificial intelligence by utilizing discrete token sequences for next-token prediction. For integrating vision with LLMs, visual tokenizers play a critical role in bridging continuous image spaces and discrete input formats of LLMs. VQ-VAEs serve as foundational components for this integration by discretizing image latent spaces, enabling alignment between visual and linguistic modalities in vision-LLM architectures. Despite advancements in reconstruction quality, modern VQ-VAEs face a fundamental challenge: a trade-off between training efficiency and compression ratio. Current approaches can be categorized into two distinct categories. (1) high-compression but high-cost methods (e.g., MaskBit, \(\leq 256\) tokens) demand substantial computational resources, requiring over \(3,000\) GPU hours on A100 clusters. This limits accessibility to well-resourced institutions. (2) efficient but low-compression methods (e.g., TokenBridge, 4096 tokens; CODA, 2560 tokens) leverage pre-trained VAEs for rapid quantization but fail to achieve the short token lengths necessary for downstream generative tasks.
This work addresses the unmet need for a VQ-VAE framework that concurrently achieves high compression ratios and efficient training. We uncover an inherent relationship between VAEs and VQ-VAEs: under specific conditions, a pre-trained VAE can be systematically transformed into a VQ-VAE with minimal computational overhead. Unlike previous attempts such as TokenBridge and CODA, which compromise on token length, our ReVQ framework leverages pre-trained VAEs to facilitate fast VQ-VAE training while maintaining high compression performance. By integrating channel multi-group quantization to expand codebook capacity and a post rectifier to alleviate quantization errors, ReVQ compresses ImageNet images into at most \(512\) tokens while sustaining competitive reconstruction quality (rFID = \(1.06\)). ReVQ completes full training on a single NVIDIA 4090 in approximately \(22\) hours, in contrast to comparable methods that require \(4.5\) days on a \(32\) A100 GPUs.
Multi-group strategies introduce \(B\) independent codebooks \( \mathcal{C}_i \), quantizing as \begin{align} z_q^i = q(z_e^i, \mathcal{C}_i), ~i=1,\cdots,B. \end{align} If divided by spacial axis, then \(B=S\). This increases effective degrees of freedom to \(N \times B\), mitigating training difficulties. However, our research finds spatial splitting suboptimal for visual data. This reveals highly correlated \(z_e^i\) distributions under spatial splitting, versus relatively independent distributions under channel splitting. To fully utilize the flexibility of multi-group strategy, we propose a channel multi-group strategy: define \(z_e^i = [Z'_e]_{(i,\cdot)}\) and apply multi-group quantization. When token length \(B\) differs from feature dimension \(D\), we perform secondary spatial spliting after initial channel-wise division, resulting in feature vectors of dimension \(d = (H \times W \times D)/B\). The picture above shows kernel density statistics via linear dimensionality reduction on feature maps partitioned by spatial vs. channel dimensions.
we propose the Non-Activation Reset strategy. Specifically, during each training epoch, for the codebook \( \mathcal{C} \), we count the activation time \( t_i \) of each code \( c_i \). At the end of the epoch, we sort the indices of the \( N \) codes in ascending order of their \( t_i \) values, obtaining \( I = \{i_1, i_2, \cdots, i_N\} \). When there are \( r \) unactivated codes (i.e., the first \( r \) indices in \( I \) have \( t_i = 0 \)), we perform the following reset operation: \begin{align} \label{equ:reset} c_{i_u} \leftarrow c_{i_{N + 1 - u}} + \epsilon, ~ u=1, \cdots, r, \end{align} where \( \epsilon \) is a small random perturbation to avoid overlapping between codes after reset. This operation intuitively resets unactivated points to the vicinity of highly activated codes, sharing the burden of frequently activated codes and promoting a more uniform activation frequency across codes. We find that methods balancing codebook activation frequencies effectively prevent codebook collapse. Our reset strategy requires no additional loss functions or computational steps during training. Only a single reset operation at the end of each epoch, making it a plug-and-play module in code implementation.
we introduce the Quantize-then-Rectify (ReVQ) framework in this section. The proposed method posits that for the quantized features \( Z_q \) from quantizer \(q\), a rectifier \(g\) should be constructed. The reconstructed quantized features via the ReVQ method are thus given by: \begin{align} \label{equ:revq} Z_e' = g\left(q(Z_e, \mathcal{C})\right). \end{align}
Rectifier Design. DC-AE is a highly practical study that proposes a high-compression VAE architecture capable of compressing images into \(2048D\) vectors. This model employs a specially designed residual structure for image reconstruction and incorporates EfficientViT blocks in deeper stages. In our ReVQ framework, since we do not involve upsampling/downsampling of latent variables, we directly utilize an EfficientViT block as the rectifier model \(g\), which maintains consistent input and output dimensions.
Training Loss. Conventional VQ-VAE training typically involves a combination of loss functions, such as perceptual loss, Patch GAN loss, and standard \(l_2\)/ \(l_1\) losses. In our ReVQ framework, however, to avoid heavy computational loads, we treat the VAE as a black box without computing its gradients. Consequently, we only apply \(l_2\) loss in the latent space of \(Z_e\) for training. The final optimization objective is: \begin{align} \label{equ:loss} \min_{\theta_g, \mathcal{C}} L_{\text{ReVQ}} = \left\Vert Z_e - g\left(q(Z_e)\right) \right\Vert^2_2, \end{align} where \(\theta_g\) denotes the parameters of the rectifier model and \(\mathcal{C}\) represents all codebook parameters.
We conduct a comparative analysis of our ReVQ model against leading VQ-VAEs on ImageNet. Evaluations are conducted on the validation, employing four standard metrics: PSNR, SSIM, LPIPS, and rFID. Two salient observations emerge from our results. First, the model with a token length of 512 demonstrates superior performance across all metrics, surpassing both "Fine Tuning" and "Frozen" counterparts. Additionally, the configuration with a token length of 256 and a codebook size of 262144 achieves notable outcomes, surpassing all other 256 token length models except MaskBit. Second, our model exhibits a significant advantage in training efficiency. Compared with publicly available training durations of existing approaches, ReVQ reduces the total GPU hours by \(40\times \sim 150\times\).
We also compare the reconstruction quality of ReVQ with other VQ-VAEs. The red-boxed regions highlight ReVQ’s superior ability to preserve fine-grained details, particularly in areas involving complex textures and facial features.
Fig. 6 demonstrates the superiority of the multi-group strategy over the single-group strategy in quantization. We randomly initialized several 2D data points, with each data point represented by 2 tokens. For the single-group strategy, due to its inherent symmetry constraint, the reconstructed data points are forced to be symmetric about the line \(y = x\), leading to a quantization error of \(0.7\). In contrast, the multi-group strategy, with its higher degree of freedom, can better adapt to the true data distribution, achieving a minimum quantization error of \(0.2\). We also quantitatively compared the performance of space-based and channel-based spliting. As shown in Table 4, the rFID values of spliting along space and channel are presented respectively. It can be observed that under both 512-token and 256-token lengths, spliting along channel consistently outperforms space.
We first visualize the dynamic process of codebook changes under this strategy in Fig. 7. We randomly initialized several 2D data points, each represented by 1 token. Without the reset strategy, the codebook is heavily influenced by the initialization, resulting in only a few codes being used (e.g., only 2 codes in this case) and a quantization error of \(2.8\). With the Reset strategy, inactive codes are reset to data-dense regions during training, as shown by the orange dashed arrows in the figure. This ensures all codes are used, reducing the quantization error to \(0.4\). To more thoroughly demonstrate the effectiveness of this strategy, we conducted quantitative experiments on 10% of the ImageNet dataset, as shown in Fig. 8. The results show that without the reset strategy, codebook utilization decreases rapidly as the codebook size increases, with only 65.3% of the codes utilized. In contrast, with the reset strategy, codebook utilization remains above 97% without significant decline as the codebook size increases.
We initiate our analysis by evaluating the impact of the rectifier module on model performance in Fig. 9. We conduct training on the ImageNet dataset using different token lengths and their corresponding codebook sizes, with consistent training strategies and an identical rectifier design. The use of the rectifier consistently reduces reconstruction loss across all token lengths. Notably, the improvement is more pronounced when the baseline model is weaker.
We further examine how different architectural designs of the rectifier affect model performance. In particular, we investigate three rectifier designs employing ViT, CNN, and MLP backbones. All other training settings are kept identical. Empirical evidence indicates that the ViT rectifier consistently surpasses its CNN and MLP counterparts across both configurations.
Additionally, conventional VQ-VAEs adopt symmetrical architectural designs. A natural question arises: why not use an additional encoder before the quantizer? In Table 6, we explored adding an encoder matching the rectifier's architecture before the quantizer to improve reconstruction performance. We found this greatly increased training difficulty, causing a significant rise in rFID. Thus, we ultimately chose not to add an extra encoder before the quantizer.