OpenAI has announced the rollout of its new flagship model, GPT-4o, which introduces significant advancements in multimodal AI capabilities, combining text, vision, and now audio functionalities.
This latest iteration, denoted as “omni” for its comprehensive processing abilities, is set to enhance the user experience across OpenAI’s suite of applications, including the popular ChatGPT.
Multimodal Capabilities and Enhancements in GPT-4o
GPT-4o extends beyond its predecessor GPT-4 Turbo by integrating audio processing, enhancing its utility in real-time interactions. Users can now engage with ChatGPT in a more dynamic manner, such as interrupting the AI mid-response and receiving adjustments based on the conversational context and emotional cues.
This feature promises a more assistant-like experience, making AI interactions feel more natural and intuitive.
The model’s vision capabilities are also enhanced; for example, it can analyze images to answer context-specific questions quickly, which could range from identifying elements in a software code to recognizing brands and items in images. This functionality extends to practical applications like translating menus from different languages or potentially providing live explanations of events, such as sports games.
GPT-4o Model evaluations
As measured on traditional benchmarks, GPT-4o achieves GPT-4 Turbo-level performance on text, reasoning, and coding intelligence, while setting new high watermarks on multilingual, audio, and vision capabilities.
![Improved Reasoning - GPT-4o sets a new high-score of 87.2% on 5-shot MMLU (general knowledge questions). (Note: Llama3 400b(opens in a new window) is still training)](https://i0.wp.com/nosisnews.com/wp-content/uploads/2024/05/image-62.png?resize=755%2C617&ssl=1)
![Audio ASR performance - GPT-4o dramatically improves speech recognition performance over Whisper-v3 across all languages, particularly for lower-resourced languages.](https://i0.wp.com/nosisnews.com/wp-content/uploads/2024/05/image-63.png?resize=819%2C622&ssl=1)
![Audio translation performance - GPT-4o sets a new state-of-the-art on speech translation and outperforms Whisper-v3 on the MLS benchmark.](https://i0.wp.com/nosisnews.com/wp-content/uploads/2024/05/image-64.png?resize=802%2C561&ssl=1)
![M3Exam - The M3Exam benchmark is both a multilingual and vision evaluation, consisting of multiple choice questions from other countries’ standardized tests that sometimes include figures and diagrams. GPT-4o is stronger than GPT-4 on this benchmark across all languages. (We omit vision results for Swahili and Javanese, as there are only 5 or fewer vision questions for these languages.](https://i0.wp.com/nosisnews.com/wp-content/uploads/2024/05/image-65.png?resize=822%2C575&ssl=1)
![Vision understanding evals - GPT-4o achieves state-of-the-art performance on visual perception benchmarks.](https://i0.wp.com/nosisnews.com/wp-content/uploads/2024/05/image-66.png?resize=799%2C509&ssl=1)
User Interface Enhancements
Looking ahead, GPT-4o aims to simplify user interactions with technology, minimizing the need for users to focus on the user interface and allowing them to concentrate more on their interaction with the AI. In terms of accessibility, GPT-4o supports an increased number of languages and boasts improvements in processing speed, cost-efficiency, and scalability in OpenAI’s API.
When using GPT-4o, ChatGPT Free users will now have access to features such as:
- Experience GPT-4 level intelligence
- Get responses (opens in a new window) from both the model and the web
- Analyze data (opens in a new window) and create charts
- Chat about photos you take
- Upload files (opens in a new window) for assistance summarizing, writing or analyzing
- Discover and use GPTs and the GPT Store
- Build a more helpful experience with Memory
Recognizing the potential risks of misuse with such powerful capabilities, OpenAI plans a cautious rollout of GPT-4o’s audio features, initially limiting access to a select group of trusted partners. This phased approach reflects OpenAI’s commitment to safety and responsible AI usage.
GPT-4o Model safety and limitations
GPT-4o has incorporated enhanced safety features right from the design stage. Techniques such as filtering training data and refining the model’s behavior post-training help ensure that the AI operates safely across different modalities. Additionally, new safety systems specifically designed for voice outputs provide necessary guardrails, addressing the unique challenges posed by audio interactions.
The model has been thoroughly evaluated using OpenAI’s Preparedness Framework to assess potential risks in areas such as cybersecurity, chemical, biological, radiological, and nuclear (CBRN) threats, persuasion capabilities, and model autonomy. GPT-4o has consistently been rated as posing no more than a Medium risk across these categories. This evaluation process included a mix of automated and human-led tests, applied to both pre-mitigation and post-mitigation versions of the model, to gauge its safety accurately.
OpenAI has engaged over 70 external experts from various fields, including social psychology, bias and fairness, and misinformation, to perform an extensive red-teaming exercise on GPT-4o. This process helped identify and address risks introduced or amplified by the model’s new capabilities. The insights from these activities are instrumental in shaping ongoing safety interventions, ensuring that GPT-4o remains a secure AI tool as it evolves.
Platform Updates
The model is accessible on the free tier of ChatGPT and to subscribers of ChatGPT Plus and Team plans, which offer higher message limits. OpenAI is also refreshing the ChatGPT UI and has launched a desktop version for macOS, enhancing accessibility and user experience, with a Windows version expected later.
Moreover, OpenAI continues to expand its ecosystem, providing access to the GPT Store and enhancing memory capabilities in ChatGPT, allowing the AI to remember user preferences for more personalized interactions.
You can now have voice conversations with ChatGPT directly from your computer, starting with Voice Mode that has been available in ChatGPT at launch, with GPT-4o’s new audio and video capabilities coming in the future.
![OpenAI launches GPT-4o and more tools to ChatGPT free users image 61](https://i0.wp.com/nosisnews.com/wp-content/uploads/2024/05/image-61.png?resize=1024%2C640&ssl=1)
This series of updates from OpenAI not only showcases significant technological advancements but also illustrates a careful consideration of ethical implications, aiming to enhance user experience while maintaining rigorous standards of safety and accessibility.