Computer vision

ImageBind: a new way to ‘link’ AI across the senses

Introducing ImageBind, the first AI model capable of binding data from six modalities at once, without the need for explicit supervision. By recognizing the relationships between these modalities — images and video, audio, text, depth, thermal and inertial measurement units (IMUs) — this breakthrough helps advance AI by enabling machines to better analyze many different forms of information, together.
Explore the demo to see ImageBind's capabilities across image, audio and text modalities.

Multimodal AI

One embedding to bind them all

For humans, a single image can ‘bind’ together an entire sensory experience. ImageBind achieves this by learning a single embedding space that binds multiple sensory inputs together — without the need for explicit supervision. It can even upgrade existing AI models to support input from any of the six modalities, enabling audio-based search, cross-modal search, multimodal arithmetic, and cross-modal generation.

Read the blog post

Emergent recognition performance

Enabling zero-shot and few-shot recognition

The open source ImageBind model achieves a new SOTA performance on emergent zero-shot recognition tasks across modalities — even better than prior specialist models trained specifically for those modalities.

Read the paper