Google’s Gemini AI demo is too good to be true (literally)

 Google Gemini by Google Deep Mind.
Google Gemini by Google Deep Mind.

This week, Google’s December Feature Drop for Pixel devices was accompanied by a wealth of content showcasing the brand’s latest multimodal AI, Gemini. In a series of videos highlighting the software’s key and unique features, the brightest minds from Google HQ wowed us with a series of bold claims and presentations.

Enter a slow fade-in from black mixed with “generic-slow-build-piano-music.mp3” as Google CEOs Sundar Pichai and Demis Hassabis talk about their drives and passion for the Gemini project like it was an American Idol audition vignette for the brainiest boyband imaginable.

The dawn of a Gemini era

What followed was a series of talking head moments with the Googleplex’s top brass hyping up the moment as if they’d just figured out how to hardcode the return of the Messiah. Unlike most AI models, Gemini is capable of deciphering a mix of text, code, and media all at once to understand the wider context of what it is tasked with achieving.

For example, in one demonstration Gemini was pre-prompted by text to act as a kitchen assistant, and then a voice recording was added to the prompt where a user asked for instructions on how to begin making a veggie omelet using the available ingredients. Finally, a picture of the ingredients was included before Gemini was asked to generate a response.

In the video above, you'll be able to see that Gemini’s reply adopted the role of the kitchen assistant, observed the available ingredients, and then delivered the first step in veggie omelet preparation in the form of a voice note. Impressive stuff, to be sure. Especially when the user was able to show Gemini an updated picture of their omelet in process and ask Gemini how it was going, with the AI picking up that the dish was ready to be cooked on the other side.

That's AI-volution, baby

Of course, Gemini’s capabilities scale far and beyond the simple frying of an egg. To showcase Gemini’s potential in full, Google had prepared a near-six-minute “hands-on” demonstration with the AI.

Within those six minutes, Gemini was able to play a round of the shell game, solve a dot-to-dot picture, and react in real-time (sometimes unprompted) to a drawing as it was created. All of this while engaging in back-and-forth audio conversation.

Presenting a fluid conversation between user and AI, some advanced image recognition techniques, and a fair amount of personality, the demo showcased the AI of our dreams — quite an apt statement given that the interactions being shown were mostly fiction.

The Gemini lie

Following the release of the “hands-on” video, Bloomberg Opinion columnist Parmy Olsen was quick to point out the small print included in the video description, reading: “For the purposes of this demo, latency has been reduced and Gemini outputs have been shortened for brevity.”

However, according to Bloomberg, when asked for comment Google expanded on this disclaimer further — revealing that the demo didn’t happen in real-time, nor did it include spoken prompts.

In fact, a Google spokesperson revealed that the video was made by using “still image frames from the footage, and prompting via text.” In essence, the very same interaction as was shown in the previous veggie omelet example.

While ultimately still impressive, the edited video does somewhat misrepresent the experience of using Gemini — painting it as a far more capable tool than it currently is.

Outlook

It’s a shame Google attempted to hot dog and grandstand about the Gemini experience in this way, as it casts a needless shadow of doubt over its new AI model. Google was caught flatfoot by the explosion of AI over the last year and has been burning the midnight oil to play catch up to competitors like OpenAI and Microsoft. Misleading users isn’t going to help that situation.

But how egregious was this video in reality? I took to Google Bard, which Google recommends to visit if you want to take the Gemini model for an early test run, to give some of the examples seen in the “hands-on” video a go for myself. The results were… Well, not promising.

Google Bard image recognition fail, with Gemini
Google Bard image recognition fail, with Gemini

While slightly frustrated that Google reduced itself to Ubisoft Store levels of ‘bullshotting’ to make its new multimodal AI more captivating, I’m still impressed with what Google had to show — even if a lot of what Gemini is pictured to be is what Bard is currently touted as being capable of now (though, admittedly, failing to achieve).

Google still have some way to go before catching up with the pack. And while I feel that as a company they're more than capable of doing so, I don't think misleading videos will help in any way.