Unlock the power of multimodal AI and learn how modern systems combine text, images, speech, and video to create intelligent applications. This course teaches the foundational concepts behind multimodal GenAI applications, the challenges of integrating diverse data types, and the techniques used to build advanced, interactive systems. You’ll develop core skills in transcription, text-to-speech, image generation, video synthesis, and multimodal reasoning.

After completing this course, you will be able to:

Build the job-ready skills you need to build multimodal generative AI applications in just a few hours
Understand the fundamental concepts and challenges in multimodal AI, including the integration of text, speech, images, and video
Build multimodal AI applications using state-of-the-art models and frameworks such as IBM Granite, Meta’s Llama, OpenAI Whisper, DALL·E, and Sora
Develop multimodal AI solutions, including chatbots and image/video generation models, using IBM watsonx.ai, Hugging Face, Flask, and Gradio
Apply multimodal search, retrieval, and question answering techniques to solve practical problems
Design and deploy full-stack multimodal systems that combine audio, vision, and language models

Through hands-on labs, you’ll work with Generative AI models like IBM Granite, OpenAI Whisper, DALL·E, Sora, Meta’s Llama, Mixtral, and vision-language architectures to apply multimodal AI in practical scenarios. You’ll build tools such as captioning systems, video-from-text generators, and AI-powered assistants that can process and respond across multiple data streams.

The course includes full-stack projects using Python, Flask, and Gradio, where you’ll design and deploy complete multimodal AI applications. By the end, you’ll have the technical skills needed to create next-generation AI systems used in search engines, chatbots, creative tools, and enterprise applications.

The following skills are required to be successful with this course:

Prior knowledge of Python programming, Flask, Gradio, and LangChain

Course Syllabus

Module 1: Foundations of Multimodal AI

Module Summary and Learning Objectives
Welcome to the Course
- Video: Course Introduction
- Reading: Course Overview
- Plugin/Reading: Helpful Tips for Course Completion
- RAG and Agentic AI Professional Certificate Overview
Introduction to Multimodal AI: Text and Speech Processing
- Video: Introduction to Multimodal AI
- Reading: What Is Multimodal Generative AI and Why Does It Matter?
- Reading: What Is Computer Vision?
- Video: Text-to-Speech Technologies
- Video: Speech-to-Text Technologies
- Reading: Text Processing, Speech Processing, and TTS
- Lab: Use Mixtral and gTTS to Create Your Personal Storyteller
- Reading: Challenges in Multimodal AI Integration
- Lab: Build a Meeting Assistant with Whisper, LangChain, & Gradio
- Practice Quiz: Introduction to Multimodal AI: Text and Speech Processing
Module Summary and Evaluation
- Reading: Summary and Highlights
- Cheat Sheet: Foundations of Multimodal AI
- Graded Quiz: Foundations of Multimodal AI

Module 2: Integrating Visual and Video Modalities

Module Summary and Learning Objectives
Image Generation and Captioning
- Video: Understanding Image Captioning with LLaMA 3.2
- Reading: Introduction to Text-to-Video and Image-to-Video Technologies
- Demo Video: Text-to-Video/Image-to-Video Models
- Lab: DALL·E Image Generation Guide for Beginners
- Reading: Strengths, Limitations, and Practical Applications of Multimodal Vision Models in Real World Scenarios
- Lab: Build an Image Captioning System with watsonx and Meta’s Llama
- Practice Quiz: Image Generation and Captioning
Module Summary and Evaluation
- Reading: Summary and Highlights
- Cheat Sheet: Integrating Visual and Video Modalities
- Graded Quiz: Integrating Visual and Video Modalities

Module 3: Advanced Multimodal Applications

Module Summary and Learning Objectives
Build Advanced Multimodal Applications
- Introduction to Multimodal Retrieval-Augmented Generation
- Lab: Build a Style Finder Using Multimodal Retrieval and Search
- Video: Multimodal Chatbots and QA Systems
- Lab: Build Your First GenAI-Powered Image-Based Web Application
- Practice Quiz: Build Advanced Multimodal Applications
Module Summary and Evaluation
- Summary and Highlights
- Cheat Sheet: Advanced Multimodal Applications
- Graded Quiz: Advanced Multimodal Applications

Course Wrap-Up

Course Wrap-Up
Reading: Congratulations and Next Steps
Thanks from the Course Team

Developing Multimodal Generative AI Applications

Language

Topic

Industries

Skills You Will Learn

Offered By

Estimated Effort

Platform

Last Update