This notebook is from OpenAI Cookbooks, enhanced with Portkey observability and features
{text, image}
inputs only, with {text}
outputs, the same modalities as gpt-4-turbo
. Additional modalities, including audio, will be introduced soon.
This guide will help you get started with using GPT-4o for text, image, and video understanding.
system
and user
messages for our first request, and we’ll receive a response from the assistant
role.
brew install ffmpeg
or sudo apt install ffmpeg
{audio}
input for GPT-4o isn’t currently available but will be coming soon! For now, we use our existing whisper-1
model to process the audio
Comparing the three answers, the most accurate answer is generated by using both the audio and visual from the video. Sam Altman did not discuss the raising windows or radio on during the Keynote, but referenced an improved capability for the model to execute multiple functions in a single request while the examples were shown behind him.