Google flashes everyone — new Gemini Flash 1.5 takes on GPT-4o
Google has launched a new member of the Gemini family of artificial intelligence models. Sitting between the on-device Nano and cloud-based Pro, Gemini Flash is designed for chat, complex tasks that require a fast response and handling images, video and speech.
Unveiled at the annual Google I/O developer event, Gemini Flash 1.5 is a native multimodal model similar to OpenAI’s recently unveiled GPT-4o and was built for speed, making it useful for real-time conversations.
The new model is currently available globally for developers to use in their own applications, so we could see a number of third-party live chat apps built using Gemini Flash 1.5 soon.
We also saw an upgrade to Gemini Pro 1.5, the model first released earlier this year and the news it will now power the Gemini Advanced premium chatbot.
What makes Gemini Flash 1.5 different?
Gemini Flash 1.5 sits just above Nano and just below Pro in the size hierarchy and what makes it different, not just to its siblings but other AI models, is the combination of speed and agility.
In addition to being fast and impressive in its ability to understand text, images, video and speech, Flash 1.5 is cheap — at least compared to Pro which is 20 times more expensive.
“We know from user feedback that some applications need lower latency and a lower cost to serve,” said Google DeepMind CEO Demis Hassabis. “This inspired us to keep innovating,” he added, unveiling Flash as a “model that’s lighter-weight than 1.5 Pro, and designed to be fast and efficient to serve at scale.”
A good comparison, at least in terms of speed, is with OpenAI’s recently announced GPT-4o model. It is very fast, natively multimodal and designed for real-time interaction. That said, Gemini Flash 1.5 seems to be a less capable model in terms of reasoning.
What about the massive context window?
Like other Gemini family models, Flash 1.5 comes with a massive one million token context window and the promise of actually being able to utilize it in full. In comparison, GPT-4o has a 128,000 token content window and Claude 3 is at 200,000 tokens.
What makes a large context window so important is the ability to hold a massive amount of information in its memory within a single conversation. This is vital when it comes to analyzing non-text content as an image is worth 1,000 words and a video even more.
It was also trained by its big brother, Gemini Pro 1.5. Hassabis said this was done “through a process called ‘distillation,’ where the most essential knowledge and skills from a larger model are transferred to a smaller, more efficient model.”
“1.5 Flash excels at summarization, chat applications, image and video captioning, data extraction from long documents and tables, and more,” as a result of this process, he said.
As these models, including the faster but smaller ones like Flash, gain the ability to understand more than just text that increased context window becomes even more important.