Back to Catalog

Developing Multimodal Generative AI Applications

PremiumIntermediateCourse

Learn multimodal AI by building applications that combine text, speech, images, and video using Whisper, DALL·E, Sora, Llama, Granite, Mixtral, Flask, and Gradio through hands-on labs and real projects.

Language

  • English

Topic

  • Artificial Intelligence

Industries

  • Information Technology

Skills You Will Learn

  • Application Development, Flask, Generative AI, Gradio, Mixtral, Multimodal AI

Offered By

  • IBMSkillsNetwork

Estimated Effort

  • 7 hours

Platform

  • SkillsNetwork

Last Update

  • March 30, 2026
About this Course
Unlock the power of multimodal AI and learn how modern systems combine text, images, speech, and video to create intelligent applications. This course teaches the foundational concepts behind multimodal GenAI applications, the challenges of integrating diverse data types, and the techniques used to build advanced, interactive systems. You’ll develop core skills in transcription, text-to-speech, image generation, video synthesis, and multimodal reasoning.

After completing this course, you will be able to:

  • Build the job-ready skills you need to build multimodal generative AI applications in just a few hours
  • Understand the fundamental concepts and challenges in multimodal AI, including the integration of text, speech, images, and video
  • Build multimodal AI applications using state-of-the-art models and frameworks such as IBM Granite, Meta’s Llama, OpenAI Whisper, DALL·E, and Sora
  • Develop multimodal AI solutions, including chatbots and image/video generation models, using IBM watsonx.ai, Hugging Face, Flask, and Gradio
  • Apply multimodal search, retrieval, and question answering techniques to solve practical problems
  • Design and deploy full-stack multimodal systems that combine audio, vision, and language models

Through hands-on labs, you’ll work with Generative AI models like IBM Granite, OpenAI Whisper, DALL·E, Sora, Meta’s Llama, Mixtral, and vision-language architectures to apply multimodal AI in practical scenarios. You’ll build tools such as captioning systems, video-from-text generators, and AI-powered assistants that can process and respond across multiple data streams.

The course includes full-stack projects using Python, Flask, and Gradio, where you’ll design and deploy complete multimodal AI applications. By the end, you’ll have the technical skills needed to create next-generation AI systems used in search engines, chatbots, creative tools, and enterprise applications.

The following skills are required to be successful with this course: 
 
  • Prior knowledge of Python programming, Flask, Gradio, and LangChain 



Course Syllabus

Module 1: Foundations of Multimodal AI
  • Module Summary and Learning Objectives
  • Welcome to the Course
    • Video: Course Introduction
    • Reading: Course Overview
    • Plugin/Reading: Helpful Tips for Course Completion
    • RAG and Agentic AI Professional Certificate Overview
  • Introduction to Multimodal AI: Text and Speech Processing
    • Video: Introduction to Multimodal AI
    • Reading: What Is Multimodal Generative AI and Why Does It Matter?
    • Reading: What Is Computer Vision?
    • Video: Text-to-Speech Technologies
    • Video: Speech-to-Text Technologies
    • Reading: Text Processing, Speech Processing, and TTS
    • Lab: Use Mixtral and gTTS to Create Your Personal Storyteller
    • Reading: Challenges in Multimodal AI Integration
    • Lab: Build a Meeting Assistant with Whisper, LangChain, & Gradio
    • Practice Quiz: Introduction to Multimodal AI: Text and Speech Processing
  • Module Summary and Evaluation
    • Reading: Summary and Highlights
    • Cheat Sheet: Foundations of Multimodal AI
    • Graded Quiz: Foundations of Multimodal AI

Module 2: Integrating Visual and Video Modalities
  • Module Summary and Learning Objectives
  • Image Generation and Captioning
    • Video: Understanding Image Captioning with LLaMA 3.2
    • Reading: Introduction to Text-to-Video and Image-to-Video Technologies
    • Demo Video: Text-to-Video/Image-to-Video Models
    • Lab: DALL·E Image Generation Guide for Beginners
    • Reading: Strengths, Limitations, and Practical Applications of Multimodal Vision Models in Real World Scenarios
    • Lab: Build an Image Captioning System with watsonx and Meta’s Llama
    • Practice Quiz: Image Generation and Captioning
  • Module Summary and Evaluation
    • Reading: Summary and Highlights
    • Cheat Sheet: Integrating Visual and Video Modalities
    • Graded Quiz: Integrating Visual and Video Modalities

Module 3: Advanced Multimodal Applications
  • Module Summary and Learning Objectives
  • Build Advanced Multimodal Applications
    • Introduction to Multimodal Retrieval-Augmented Generation
    • Lab: Build a Style Finder Using Multimodal Retrieval and Search
    • Video: Multimodal Chatbots and QA Systems
    • Lab: Build Your First GenAI-Powered Image-Based Web Application
    • Practice Quiz: Build Advanced Multimodal Applications
  • Module Summary and Evaluation
    • Summary and Highlights
    • Cheat Sheet: Advanced Multimodal Applications
    • Graded Quiz: Advanced Multimodal Applications

Course Wrap-Up
  • Course Wrap-Up
  • Reading: Congratulations and Next Steps
  • Thanks from the Course Team