X.ai, Elon Musk’s AI startup, has unveiled its latest generative AI model, Grok-1.5. Set to power the social network X’s Grok chatbot in the near future, Grok-1.5 promises notable improvements over its predecessor, Grok-1, as indicated by benchmark results and specifications published by X.
Grok-1.5 boasts “improved reasoning,” particularly in coding and math-related tasks, surpassing Grok-1’s performance on various benchmarks. For instance, it more than doubles Grok-1’s score on the MATH benchmark and significantly outperforms it on the HumanEval test, evaluating programming language generation and problem-solving abilities.
However, the real test lies in practical usage, as commonly-used AI benchmarks may not fully capture how individuals interact with models in everyday scenarios.
Features of Grok-1.5
Capabilities and Reasoning
One of the most notable improvements in Grok-1.5 is its performance in coding and math-related tasks. In our tests, Grok-1.5 achieved a 50.6% score on the MATH benchmark and a 90% score on the GSM8K benchmark, two math benchmarks covering a wide range of grade school to high school competition problems. Additionally, it scored 74.1% on the HumanEval benchmark, which evaluates code generation and problem-solving abilities.
Long Context Understanding
A new feature in Grok-1.5 is the capability to process long contexts of up to 128K tokens within its context window. This allows Grok to have an increased memory capacity of up to 16 times the previous context length, enabling it to utilize information from substantially longer documents.
According to X.ai, Grok-1.5 can now handle longer and more complex prompts while still maintaining its instruction-following capability, thanks to its expanded context window.
In the Needle In A Haystack (NIAH) evaluation, Grok-1.5 demonstrated powerful retrieval capabilities for embedded text within contexts of up to 128K tokens in length, achieving perfect retrieval results.
Grok-1.5 Infra
Cutting-edge Large Language Model (LLMs) research that runs on massive GPU clusters demands robust and flexible infrastructure. Grok-1.5 is built on a custom distributed training framework based on JAX, Rust, and Kubernetes.
This training stack enables our team to prototype ideas and train new architectures at scale with minimal effort. A major challenge of training LLMs on large compute clusters is maximizing reliability and uptime of the training job.
Historically, X.ai’s Grok models have stood out for their willingness to engage with topics typically avoided by other models, such as conspiracies and controversial political ideas. Additionally, they exhibit a “rebellious streak,” responding with blunt language when requested.
It remains unclear if Grok-1.5 introduces any changes in these aspects, as X.ai’s blog post does not address them. The announcement of Grok-1.5 comes after X.ai open sourced Grok-1, without the code necessary to fine-tune or further train it.
Grok-1.5 will soon be available to early testers, and we look forward to receiving your feedback to help us improve Grok. As we gradually roll out Grok-1.5 to a wider audience, we are excited to introduce several new features over the coming days.