Organizations seek skilled, forward-thinking Big Data practitioners adept at applying both business and technical expertise to unstructured data like tweets, posts, images, audio, videos, sensor data, satellite imagery, and more, enabling insights into behaviors and preferences of prospects, clients, competitors, and others.

This course introduces fundamental Big Data concepts and practices. You will gain an understanding of Big Data's characteristics, features, benefits, and limitations, and explore various Big Data processing tools. The course delves into how Hadoop, Hive, and Spark can assist organizations in overcoming Big Data challenges and leveraging its potential.

Hadoop, an open-source framework, facilitates distributed processing of extensive datasets across computer clusters using straightforward programming models. Each node within the cluster provides local computation and storage, optimizing dataset processing efficiency. Hive, a data warehousing software, offers an SQL-like interface for efficient querying and manipulation of large datasets across diverse databases and file systems integrated with Hadoop.

Apache Spark, an open-source processing engine, emphasizes speed, ease of use, and advanced analytics capabilities, revolutionizing big data storage and utilization.

Throughout the course, you will learn to harness Spark for delivering dependable insights. The curriculum covers Apache Spark's components in-depth, highlighting its Resilient Distributed Datasets (RDDs) that enable parallel processing across Spark cluster nodes.

Practical skills you will learn in this course include data analysis using MapReduce with Hadoop, PySpark and Spark SQL, along with building streaming analytics applications using Spark Streaming.

What you will learn:

After completing this course, you will be able to:

Describe Big Data, its impact, processing methods and tools, and use cases.
Explain Hadoop architecture, ecosystem, practices, and applications, including Distributed File System (HDFS), HBase, Spark, and MapReduce.
Explain Spark programming basics, including parallel programming basics, for DataFrames, data sets, and SparkSQL.
Explain how Spark uses RDDs, creates data sets, and uses Catalyst and Tungsten to optimize SparkSQL.
Work with Apache Spark development and runtime environment options.

Course Syllabus

Module 1: What is Big Data?

Module Introduction and Learning Objectives
What is Big Data?
Impact of Big Data
Parallel Processing, Scaling, and Data Parallelism
Big Data Tools and Ecosystem
Open Source and Big Data
Beyond the Hype
Big Data Use Cases
Summary & Highlights: Introduction to Big Data
Practice Quiz: Introduction to Big Data
Module 1 Glossary: What is Big Data?
Graded Quiz: What is Big Data?

Module 2: Introduction to the Hadoop

Module Introduction and Learning Objectives
Introduction to Hadoop
Intro to MapReduce
Hadoop Ecosystem
HDFS
HIVE
Hands-on Lab: Getting Started with Hive
HBASE
Hands-on Lab: Hadoop MapReduce
Summary & Highlights: Introduction to Hadoop
Practice Quiz: Introduction to Hadoop
Cheat Sheet: Introduction to the Hadoop Ecosystem
Module 2 Glossary: Introduction to the Hadoop Ecosystem
Graded Quiz: Introduction to the Hadoop Ecosystem

Module 3: Apache Spark

Module Introduction and Learning Objectives
Why use Apache Spark?
Functional Programming Basics
Parallel Programming using Resilient Distributed Datasets
Scale out / Data Parallelism in Apache Spark
Dataframes and SparkSQL
Hands-on Lab: Getting Started with Spark using Python
Summary & Highlights: Introduction to Apache Spark
Practice Quiz: Introduction to Apache Spark
Cheat Sheet: Apache Spark
Module 3 Glossary: Apache Spark
Graded Quiz: Apache Spark

Module 4: DataFrames and SparkSQL

Module Introduction and Learning Objectives
RDDs in Parallel Programming and Spark
Data-frames and Datasets
Catalyst and Tungsten
ETL with Data-frames
Hands-on Lab: Introduction to Data-Frames
Real-world usage of SparkSQL
Common Transformations and Optimization Techniques in Spark
Hands-on Lab: Introduction to SparkSQL
Summary & Highlights: Introduction to Data-Frames & SparkSQL
Practice Quiz: Introduction to Data-Frames & SparkSQL
Cheat Sheet: Data-Frames & SparkSQL
Module 4 Glossary: Data-Frames & SparkSQL
Graded Quiz: Data-Frames & SparkSQL

Module 5: Development and Runtime Environment options

Module Introduction and Learning Objectives
Apache Spark Architecture
Overview of Apache Spark Cluster Modes
How to Run an Apache Spark Application
Hands-on Lab: Submit Apache Spark Applications
Summary & Highlights: Spark Architecture
Practice Quiz: Spark Architecture
Overview of Spark Environments - Options about Spark Environment
Using Apache Spark on IBM Cloud
How to set-up your own Spark Environment (Optional)
Setting Apache Spark Configuration
Running Spark on Kubernetes
Hands-on Lab: Apache Spark on Kubernetes
Summary & Highlights: Spark Runtime Environments
Practice Quiz: Spark Runtime Environments
Cheat Sheet: Development and Runtime Environment Options
Module 5 Glossary: Development and Runtime Environment Options
Graded Quiz: Development and Runtime Environment Options

Module 6: Monitoring & Tuning

Module Introduction and Learning Objectives
The Apache Spark User Interface
Monitoring Application Progress
Debugging Apache Spark Application Issues
Understanding Memory resources
Understanding Processor resources
Hands-on Lab: Monitoring and Performance tuning
Summary and Highlights: Introduction to Monitoring & Tuning
Practice Quiz: Introduction to Monitoring & Tuning
Cheat Sheet: Monitoring & Tuning
Module 6 Glossary: Monitoring and Tuning
Graded Quiz: Monitoring & Tuning

Module 7: Final Project and Assessment

Module Introduction and Learning Objectives
Final Project: Data Processing using Spark
Final Exam Instructions
Final Exam
Course Rating
Badges Frequently Asked Questions
Claim badge here
Introduction to Big Data with Spark and Hadoop Glossary
Congratulations and Next Steps
Team and Acknowledgements
Copyrights and Trademarks

General Information

This course is self-paced.
This platform works best with current versions of Chrome, Edge, Firefox, Internet Explorer, or Safari.

Recommended Skills Prior to Taking this Course

Computer and data literacy.
Familiarity with SQL and databases.

Spark and Hadoop for Big Data Analytics

Language

Topic

Skills You Will Learn

Offered By

Estimated Effort

Platform

Last Update