The AI industry is witnessing a shift towards generative AI models with longer context windows, which promise better understanding and continuity in data processing. Addressing the challenge of compute-intensive models, Or Dagan, 00product lead at AI startup AI21 Labs, introduces Jamba, a revolutionary generative model.
Unlike its counterparts, Jamba boasts a remarkable context window of up to 140,000 tokens while operating on a single GPU with ample memory like ChatGPT. This enables Jamba to process vast amounts of data, equivalent to approximately 210 pages of text, making it a formidable tool for various text-based tasks.
What sets Jamba apart is its innovative architecture, which combines two model types: transformers and state space models (SSMs). While transformers excel in complex reasoning tasks with their attention mechanism, SSMs offer computational efficiency and the ability to handle long sequences of data.
Key Features of Jamba
- First production-grade Mamba based model built on a novel SSM-Transformer hybrid architecture
- 3X throughput on long contexts compared to Mixtral 8x7B
- Democratizes access to a massive 256K context window
- The only model in its size class that fits up to 140K context on a single GPU
- Released with open weights under Apache 2.0
- Available on Hugging Face and coming soon to the NVIDIA API catalog
Jamba leverages an open source SSM model called Mamba, integrated into its core architecture. Dagan claims that Jamba achieves three times the throughput on long contexts compared to transformer-based models of similar sizes, demonstrating the effectiveness of the SSM approach.
Although Jamba is released under the Apache 2.0 license, it’s emphasized as a research release not intended for commercial use due to potential risks like generating toxic text or bias. However, AI21 Labs plans to offer a fine-tuned version with enhanced safety features in the near future.
As depicted in the diagram below, AI21’s Jamba architecture features a blocks-and-layers approach that allows Jamba to successfully integrate the two architectures. Each Jamba block contains either an attention or a Mamba layer, followed by a multi-layer perceptron (MLP), producing an overall ratio of one Transformer layer out of every eight total layers.
Dagan remains optimistic about Jamba’s potential, highlighting its scalability and efficiency on a single GPU. He believes that with further refinements, Jamba will continue to deliver superior performance, showcasing the promise of the SSM architecture for the future of generative AI models.
You can start working with Jamba on Hugging Face. As a base model, Jamba is intended for use as a foundation layer for fine tuning, training, and developing custom solutions and guardrails should be added for responsible and safe use. An instruct version will soon be available in beta via the AI21 Platform. To share what you’re working on, give feedback, or ask questions, join the conversation on Discord.