Apache® Spark™ is a fast, flexible, and developer-friendly open-source platform for large-scale SQL, batch processing, stream processing, and machine learning. Users can leverage its open-source ecosystem, speed, ease of use, and analytic capabilities to handle Big Data in innovative ways.

In this course, you will learn the fundamentals of Machine Learning (ML) and Generative AI (GenAI). You will cover the ML model lifecycle, and explore supervised and unsupervised learning. You will practice working with ML models for classification, regression, and clustering.

You will explore concepts and gain hands-on skills for using Spark in data engineering and machine learning applications. You’ll learn about Spark Structured Streaming, including data sources, output modes, and operations. Additionally, you will explore Graph theory and see how GraphFrames enhance Spark DataFrames and popular algorithms.

Learn how to use Spark for extract, transform, and load (ETL) processes, and practice your skills in the "ETL for Machine Learning Pipelines" lab.

Next, discover why Spark is favored by machine learning practitioners. You'll learn to create pipelines and implement features for data extraction, selection, and transformation on structured datasets. Explore classification and regression with Spark, understand supervised and unsupervised learning, and apply clustering techniques, including the k-means algorithm using Spark MLlib. Reinforce your knowledge through targeted hands-on labs and a final project that applies Spark to a real-world inspired problem.

What you will learn:

After completing this course, you will be able to:

Describe the fundamentals of Machine Learning and Generative AI
Differentiate between supervised and unsupervised Machine learning
Implement ML algorithms for classification, regression and clustering using Python
Explain the features, benefits, limitations, and application of Apache Spark Structured Streaming
Define Graph theory and explain how GraphFrames benefits developers
Describe how developers can apply extract, transform and load (ETL) processes using Spark.
Explain how Spark ML supports machine learning development
Apply Spark ML for regression and classification
Explain how Spark ML uses clustering
Demonstrate hands-on working knowledge of using Spark for ETL processes

Course Syllabus

Module 1 – Spark for Data Engineering

Spark Structured Streaming
GraphFrames on Apache Spark
ETL Workloads ETL for ML Pipelines

Module 2 – Spark ML for Machine Learning

Spark ML Fundamentals
Spark ML Regression and Classification
Spark ML Clustering

Module 3 – Final Project

Setup & Practice Assignment
Project Overview
Final Assignment Project
Final Quiz

General Information

This course is self-paced.
This platform works best with current versions of Chrome, Edge, Firefox, Internet Explorer, or Safari.

Recommended Skills Prior to Taking this Course

Before starting this course, ensure you have:

Foundational Spark knowledge and skills, such as those gained from the IBM course titled "Spark and Hadoop Fundamentals for Big Data Analytics."
Working knowledge of the Python programming language.

Machine Learning with Apache Spark

Language

Topic

Enrollment Count

Skills You Will Learn

Offered By

Estimated Effort

Platform

Last Update