Recent advancements in training large multimodal models have been driven by efforts to eliminate modeling constraints and unify architectures across domains. Despite these strides, many existing models still rely on separately trained components such as modality-specific encoders and decoders.
In a new paper JetFormer: An Autoregressive Generative Model of Raw Images and Text, a Google DeepMind research team introduces JetFormer, a groundbreaking autoregressive, decoder-only Transformer designed to directly model raw data. This model maximizes the likelihood of raw data without depending on any pre-trained components, and is capable of both understanding and generating text and images seamlessly.
The team summarizes the key innovations in JetFormer as follows:
- Leveraging Normalizing Flows for Image Representation: The pivotal insight behind JetFormer is its use of a powerful normalizing flow—termed a “jet”—to encode images into a latent representation suitable for autoregressive modeling. Traditional autoregression on raw image patches encoded as pixels has been impractical due to the complexity of their structure. JetFormer’s flow model addresses this by providing a lossless, invertible representation that integrates seamlessly with the multimodal model. At inference, the flow’s invertibility enables straightforward image decoding.
- Guiding the Model to High-Level Information: To enhance focus on essential high-level information, the researchers employ two innovative strategies:
- Progressive Gaussian Noise Augmentation: During training, Gaussian noise is added and gradually reduced, encouraging the model to prioritize overarching features early in the learning process.
- Managing Redundancy in Image Data: JetFormer allows selective exclusion of redundant dimensions in natural images from the autoregressive model. Alternatively, Principal Component Analysis (PCA) is explored to reduce dimensionality without sacrificing critical information.
The team evaluated JetFormer on two challenging tasks: ImageNet class-conditional image generation and web-scale multimodal generation. The results show that JetFormer is competitive with less flexible models when trained on large-scale data, excelling in both image and text generation tasks. Its end-to-end training capability further highlights its flexibility and effectiveness.
JetFormer represents a significant leap in simplifying multimodal architectures by unifying modeling approaches for text and images. Its innovative use of normalizing flows and emphasis on high-level feature prioritization marks a new era in end-to-end generative modeling. This research lays the groundwork for further exploration of unified multimodal systems, paving the way for more integrated and efficient approaches to AI model development.
The paper JetFormer: An Autoregressive Generative Model of Raw Images and Text is on arXiv.
Author: Hecate He | Editor: Chain Zhang