Google’s annual developer conference, Google I/O, took place in the early hours of May 15th in Taiwan. This year, the focus was on updating AI capabilities, with the term “AI” being mentioned a total of 122 times throughout the event.
One of the key updates is the integration of Gemini, a “multimodal” search engine and assistant. Starting this year, Google will be able to search using videos and has also introduced the AI Overview feature, which uses AI to summarize search content. The intelligent assistant, Astra, can recognize objects and actions in videos in real-time and provide relevant responses to queries. Additionally, Google introduced new versions of its large language models, Gemini 1.5 Flash and Veo, for text and image generation.
Google DeepMind’s leader, Demis Hassabis, made his first appearance at Google I/O, highlighting the latest developments in AI.
AI Revolution 1: Search Engine! Searching videos and understanding complex commands!
Google’s search engine, which has long relied on text and image searches, has now advanced to include video search capabilities. Users can now shoot videos and use voice or text to ask questions, and the search engine will automatically analyze the content and provide relevant responses. For example, during a demonstration, when facing technical difficulties with a vinyl record player, a user was able to record a video and ask why the needle was moving irregularly. Google automatically searched and provided an AI search summary through the Google Overview feature.
AI Overview: Understanding complex commands
AI Overview, a technology introduced by Google last year, summarizes and organizes search content at the top of the search engine. With the new “multi-step reasoning capability” of the Gemini model, AI Overview can handle complex queries that contain multiple requirements and details without the need for multiple searches. For example, if a user wants to find the best yoga or Pilates studio in Boston and also wants information on their new member discounts and the time it takes to walk from a specific location, AI Overview can handle all these requests.
AI Revolution 2: Astra Assistant! Real-time analysis of video content and responsive thinking
Demis Hassabis, the head of Google DeepMind, showcased Google’s “future AI assistant,” named Astra, for the first time at I/O. Astra claims to understand the dynamic and complex world like a human. Astra also possesses the multimodal capability, which includes real-time analysis of videos. It can quickly analyze and respond to dynamic scenes, even having memory capabilities. In a demonstration, a user walked around while recording videos on their phone. They asked Astra, “Where do you think I am right now?” when standing near a window and used a brush to highlight code on a computer screen, asking, “Where do you think there are issues to improve?” They even asked Astra before the end of the video, “Do you remember where I put my glasses?” Astra was able to analyze all the frames from the past few minutes, locate the frame with the glasses, and provide the answer: “Next to an apple.”
AI Revolution 3: Google Photos Search! AI helps find and organize photos
Google introduced the Ask Photos with Gemini feature in Google Photos, which uses image analysis to categorize objects in photos and assign relevant keyword tags. Users can quickly find photos with specific objects, such as their car’s license plate, or create albums related to their daughter’s swimming lessons. When asked, “When did my daughter learn the backstroke?” Gemini can quickly find relevant photos and provide the date as an answer.
AI Revolution 4: Android! Gemini enhances all experiences, including conversations and videos
Android is expected to become the best platform for experiencing Google’s AI capabilities. Gemini is always ready to provide diverse assistance within the Android operating system. During the conference, various applications were showcased, such as generating memes in chat conversations or asking about the rules in a sports video. Gemini’s advanced app, called Advanced App, allows users to instantly answer questions even when receiving an 80-page PDF file.
Gemini’s ability to process a large number of parameters allows it to read and summarize an entire economics book within seconds. It can provide summaries or answers to questions.
AI Revolution 5: Gemini Update! New model Flash, lighter and capable of processing a million tokens at once
The technology behind the large language models is the foundation for all the new features introduced this time. Gemini, as Google’s core AI language model, has evolved to have two core capabilities: “multimodal” and “massive processing.” It can now process a million tokens of text, images, and videos at once. The new member of the Gemini family, Gemini 1.5 Flash, is lighter and more efficient than Gemini 1.5 Pro but offers the same level of capability. For example, a dialogue command window can process a million tokens, which means it can analyze a document as long as 1,500 pages or over 30,000 lines of code. This lightweight model was achieved through “knowledge distillation,” making it more suitable for developers who require speed and low cost.
Gemini 1.5 Pro Update
Gemini 1.5 Pro, which was only announced in February, will also receive an upgrade later this year, doubling its token processing capacity to 2 million. This means it can process two hours of video, 22 hours of audio, over 60,000 lines of code, or over 1.4 million words of textual content simultaneously.
AI Revolution 6: Veo Image Model! Generating videos based on text commands
In terms of video generation, Google introduced Veo, competing with OpenAI’s Sora. Veo can generate high-quality, 1080p videos of over one minute based on natural language text commands. It can also understand film shooting techniques and visual effects-related terms and incorporate time-lapse photography techniques into the creation process.
OpenAI’s Sora, on the other hand, can generate complex scenes with multiple characters, specific action types, and many details. It not only understands various objects mentioned in the prompts but also knows how these objects exist in the real world, creating realistic and impressive scenes.
In addition, just a day before Google I/O, OpenAI announced the GPT-4o model, which combines GPT-4’s intelligence with powerful audio and video processing capabilities. It provides users with an interactive experience that closely resembles human interaction. GPT-4o can provide real-time translation during conversations, allowing smooth communication between people speaking different languages. It can also tell bedtime stories with a lively and expressive voice, as well as teach people how to solve simple math problems using a human-like tone.
According to OpenAI, GPT-4o can “read” the user’s facial expressions and tone, knowing how to respond and quickly switch between different tones, from a mechanical voice to an energetic singing voice.
With two major AI experts presenting their latest technologies within two days, this AI revolution will continue to impact people’s lives.
Editor: Lin Meixin