DeepMind’s JetFormer: Unified Multimodal Models Without Modelling Constraints

Recent advancements in training large multimodal models have been driven by efforts to eliminate modeling constraints and unify architectures across domains. Despite these strides, many existing models still rely on separately trained components such as modality-specific encoders and decoders.

In a new paper JetFormer: An Autoregressive Generative Model of Raw Images and Text, a Google DeepMind research team introduces JetFormer, a groundbreaking autoregressive, decoder-only Transformer designed to directly model raw data. This model maximizes the likelihood of raw data without depending on any pre-trained components, and is capable of both understanding and generating text and images seamlessly.

The team summarizes the key innovations in JetFormer as follows:

Leveraging Normalizing Flows for Image Representation: The pivotal insight behind JetFormer is its use of a powerful normalizing flow—termed a “jet”—to encode images into a latent representation suitable for autoregressive modeling. Traditional autoregression on raw image patches encoded as pixels has been impractical due to the complexity of their structure. JetFormer’s flow model addresses this by providing a lossless, invertible representation that integrates seamlessly with the multimodal model. At inference, the flow’s invertibility enables straightforward image decoding.
Guiding the Model to High-Level Information: To enhance focus on essential high-level information, the researchers employ two innovative strategies:

Progressive Gaussian Noise Augmentation: During training, Gaussian noise is added and gradually reduced, encouraging the model to prioritize overarching features early in the learning process.
Managing Redundancy in Image Data: JetFormer allows selective exclusion of redundant dimensions in natural images from the autoregressive model. Alternatively, Principal Component Analysis (PCA) is explored to reduce dimensionality without sacrificing critical information.

The team evaluated JetFormer on two challenging tasks: ImageNet class-conditional image generation and web-scale multimodal generation. The results show that JetFormer is competitive with less flexible models when trained on large-scale data, excelling in both image and text generation tasks. Its end-to-end training capability further highlights its flexibility and effectiveness.

JetFormer represents a significant leap in simplifying multimodal architectures by unifying modeling approaches for text and images. Its innovative use of normalizing flows and emphasis on high-level feature prioritization marks a new era in end-to-end generative modeling. This research lays the groundwork for further exploration of unified multimodal systems, paving the way for more integrated and efficient approaches to AI model development.

The paper JetFormer: An Autoregressive Generative Model of Raw Images and Text is on arXiv.

Author: Hecate He | Editor: Chain Zhang

Source link

What's Hot

SON DAKİKA…Tansiyon zirvede: İsrail Yemen’i vurdu: Husilerden yanıt gecikmedi! DSÖ Başkanı saldırılara havalimanında yakalandı

ITDM 2025 전망 | “불경기 시대 속 콘텐츠 산업··· 기술이 돌파구를 마련하다” CJ ENM 조성철 엔터부문 CIO

Motstridande meldingar for russisk fiskeskipper – NRK Troms og Finnmark

DeepMind’s JetFormer: Unified Multimodal Models Without Modelling Constraints

#440 – Pieter Levels: Programming, Viral AI Startups, and Digital Nomad Life

Fine-tuning Llama 3.2 3B for RAG

A new generation of African talent brings cutting-edge AI to scientific challenges

Track Computer Vision Experiments with MLflow | by Yağmur Çiğdem Aktaş | Dec, 2024

SON DAKİKA…Tansiyon zirvede: İsrail Yemen’i vurdu: Husilerden yanıt gecikmedi! DSÖ Başkanı saldırılara havalimanında yakalandı

ITDM 2025 전망 | “불경기 시대 속 콘텐츠 산업··· 기술이 돌파구를 마련하다” CJ ENM 조성철 엔터부문 CIO

Motstridande meldingar for russisk fiskeskipper – NRK Troms og Finnmark

Major storm pounds California’s central coast, blamed for man’s death and partially collapsing pier

Major storm pounds California’s central coast, blamed for man’s death and partially collapsing pier

Senator Chris Coons and the Cult of Patent Monopolies

Lichamen van honderdtal, a kindergarten student in Massagraf, Iraq.

What's Hot

DeepMind’s JetFormer: Unified Multimodal Models Without Modelling Constraints

Like this:

Related Posts

Subscribe to Updates