Machine Learning with Apache Spark
Premium
IntermediateCourseThis course introduces the fundamentals of Machine Learning (ML) with Apache Spark, covering Spark Structured Streaming, ETL for ML Pipelines, and Spark ML. By the end of the course, you’ll gain hands-on experience applying Spark skills to ETL and ML workflows.

Language
- English
Topic
- Big Data
Skills You Will Learn
- Unsupervised Learning, Machine Learning, Graph Theory, Data Engineering, Apache Spark, Batch Processing
Offered By
- IBMSkillsNetwork
Estimated Effort
- 15 Hours
Platform
- SkillsNetwork
Last Update
- December 18, 2024
About this Course
Apache® Spark™ is a fast, flexible, and developer-friendly open-source platform for large-scale SQL, batch processing, stream processing, and machine learning. Users can leverage its open-source ecosystem, speed, ease of use, and analytic capabilities to handle Big Data in innovative ways.
In this course, you will learn the fundamentals of Machine Learning (ML) and Generative AI (GenAI). You will cover the ML model lifecycle, and explore supervised and unsupervised learning. You will practice working with ML models for classification, regression, and clustering.
You will explore concepts and gain hands-on skills for using Spark in data engineering and machine learning applications. You’ll learn about Spark Structured Streaming, including data sources, output modes, and operations. Additionally, you will explore Graph theory and see how GraphFrames enhance Spark DataFrames and popular algorithms.
Learn how to use Spark for extract, transform, and load (ETL) processes, and practice your skills in the "ETL for Machine Learning Pipelines" lab.
Next, discover why Spark is favored by machine learning practitioners. You'll learn to create pipelines and implement features for data extraction, selection, and transformation on structured datasets. Explore classification and regression with Spark, understand supervised and unsupervised learning, and apply clustering techniques, including the k-means algorithm using Spark MLlib. Reinforce your knowledge through targeted hands-on labs and a final project that applies Spark to a real-world inspired problem.
What you will learn:
After completing this course, you will be able to:
- Describe the fundamentals of Machine Learning and Generative AI
- Differentiate between supervised and unsupervised Machine learning
- Implement ML algorithms for classification, regression and clustering using Python
- Explain the features, benefits, limitations, and application of Apache Spark Structured Streaming
- Define Graph theory and explain how GraphFrames benefits developers
- Describe how developers can apply extract, transform and load (ETL) processes using Spark.
- Explain how Spark ML supports machine learning development
- Apply Spark ML for regression and classification
- Explain how Spark ML uses clustering
- Demonstrate hands-on working knowledge of using Spark for ETL processes
Course Syllabus
Module 1 – Spark for Data Engineering
- Spark Structured Streaming
- GraphFrames on Apache Spark
- ETL Workloads ETL for ML Pipelines
Module 2 – Spark ML for Machine Learning
- Spark ML Fundamentals
- Spark ML Regression and Classification
- Spark ML Clustering
Module 3 – Final Project
- Setup & Practice Assignment
- Project Overview
- Final Assignment Project
- Final Quiz
General Information
- This course is self-paced.
- This platform works best with current versions of Chrome, Edge, Firefox, Internet Explorer, or Safari.
Recommended Skills Prior to Taking this Course
Before starting this course, ensure you have:
- Foundational Spark knowledge and skills, such as those gained from the IBM course titled "Spark and Hadoop Fundamentals for Big Data Analytics."
- Working knowledge of the Python programming language.

Language
- English
Topic
- Big Data
Skills You Will Learn
- Unsupervised Learning, Machine Learning, Graph Theory, Data Engineering, Apache Spark, Batch Processing
Offered By
- IBMSkillsNetwork
Estimated Effort
- 15 Hours
Platform
- SkillsNetwork
Last Update
- December 18, 2024