• For Individuals
  • For Businesses
  • For Universities
  • For Governments
Coursera
Online Degrees
Careers
Log In
Join for Free
Coursera
IBM
Introduction to Big Data with Spark and Hadoop
  • About
  • Outcomes
  • Modules
  • Recommendations
  • Testimonials
  • Reviews
  1. Browse
  2. Information Technology
  3. Data Management
IBM

Introduction to Big Data with Spark and Hadoop

This course is part of multiple programs.

This course is part of multiple programs

IBM Data Engineering Professional Certificate
IBM Data Architecture Professional Certificate
NoSQL, Big Data, and Spark Foundations Specialization
Aije Egwaikhide
Romeo Kienzler
Rav Ahuja

Instructors: Aije Egwaikhide

Instructors

Instructor ratings

We asked all learners to give feedback on our instructors based on the quality of their teaching style.

4.3 (111 ratings)
Aije Egwaikhide
Aije Egwaikhide
IBM
6 Courses•728,972 learners
Romeo Kienzler
Romeo Kienzler
IBM
10 Courses•767,985 learners
Rav Ahuja
Rav Ahuja
IBM
57 Courses•3,968,052 learners

65,681 already enrolled

Included with Coursera Plus

•Learn more
7 modules
Gain insight into a topic and learn the fundamentals.
4.4

(447 reviews)

Intermediate level

Recommended experience

Recommended experience

Intermediate level

Data literacy, Python, and SQL knowledge will be beneficial.

Flexible schedule
Approx. 19 hours
Learn at your own pace
92%
Most learners liked this course

7 modules
Gain insight into a topic and learn the fundamentals.
4.4

(447 reviews)

Intermediate level

Recommended experience

Recommended experience

Intermediate level

Data literacy, Python, and SQL knowledge will be beneficial.

Flexible schedule
Approx. 19 hours
Learn at your own pace
92%
Most learners liked this course
  • About
  • Outcomes
  • Modules
  • Recommendations
  • Testimonials
  • Reviews

What you'll learn

  • Explain the impact of big data, including use cases, tools, and processing methods.

  • Describe Apache Hadoop architecture, ecosystem, practices, and user-related applications, including Hive, HDFS, HBase, Spark, and MapReduce.

  • Apply Spark programming basics, including parallel programming basics for DataFrames, data sets, and Spark SQL.

  • Use Spark’s RDDs and data sets, optimize Spark SQL using Catalyst and Tungsten, and use Spark’s development and runtime environment options.

Skills you'll gain

  • Data Processing
  • Apache Hadoop
  • IBM Cloud
  • Debugging
  • PySpark
  • Apache Hive
  • Distributed Computing
  • Data Transformation
  • Scalability
  • Performance Tuning
  • Docker (Software)
  • Big Data
  • Apache Spark
  • Kubernetes

Details to know

Shareable certificate

Add to your LinkedIn profile

Assessments

14 assignments

Taught in English

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business
 logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

Build your subject-matter expertise

This course is available as part of
When you enroll in this course, you'll also be asked to select a specific program.
  • Learn new concepts from industry experts
  • Gain a foundational understanding of a subject or tool
  • Develop job-relevant skills with hands-on projects
  • Earn a shareable career certificate

There are 7 modules in this course

This self-paced IBM course will teach you all about big data! You will become familiar with the characteristics of big data and its application in big data analytics. You will also gain hands-on experience with big data processing tools like Apache Hadoop and Apache Spark.

Bernard Marr defines big data as the digital trace that we are generating in this digital era. You will start the course by understanding what big data is and exploring how insights from big data can be harnessed for a variety of use cases. You’ll also explore how big data uses technologies like parallel processing, scaling, and data parallelism. Next, you will learn about Hadoop, an open-source framework that allows for the distributed processing of large data and its ecosystem. You will discover important applications that go hand in hand with Hadoop, like Distributed File System (HDFS), MapReduce, and HBase. You will become familiar with Hive, a data warehouse software that provides an SQL-like interface to efficiently query and manipulate large data sets. You’ll then gain insights into Apache Spark, an open-source processing engine that provides users with new ways to store and use big data. In this course, you will discover how to leverage Spark to deliver reliable insights. The course provides an overview of the platform, going into the components that make up Apache Spark. You’ll learn about DataFrames and perform basic DataFrame operations and work with SparkSQL. Explore how Spark processes and monitors the requests your application submits and how you can track work using the Spark Application UI. This course has several hands-on labs to help you apply and practice the concepts you learn. You will complete Hadoop and Spark labs using various tools and technologies, including Docker, Kubernetes, Python, and Jupyter Notebooks.

In this module, you’ll begin your acquisition of Big Data knowledge with the most up-to-date definition of Big Data. You’ll explore the impact of Big Data on everyday personal tasks and business transactions with Big Data Use Cases. You’ll also learn how Big Data uses parallel processing, scaling, and data parallelism. Going further, you’ll explore commonly used Big Data tools and explain the role of open-source in Big Data. Finally, you’ll go beyond the hype and explore additional Big Data viewpoints.

What's included

8 videos1 reading2 assignments2 plugins

8 videos•Total 47 minutes
  • Course Introduction•5 minutes•Preview module
  • What is Big Data?•7 minutes
  • Impact of Big Data•5 minutes
  • Parallel Processing, Scaling, and Data Parallelism•7 minutes
  • Big Data Tools and Ecosystem•4 minutes
  • Open Source and Big Data•6 minutes
  • Beyond the Hype•4 minutes
  • Big Data Use Cases•5 minutes
1 reading•Total 2 minutes
  • Summary and Highlights: Introduction to Big Data•2 minutes
2 assignments•Total 41 minutes
  • Practice Quiz: Introduction to Big Data•14 minutes
  • Graded Quiz: What Is Big Data?•27 minutes
2 plugins•Total 27 minutes
  • Introduction to Emerging Big Data Technologies•15 minutes
  • Module 1 Glossary: What Is Big Data?•12 minutes

In this module, you'll gain a fundamental understanding of the Apache Hadoop architecture, ecosystem, practices, and commonly used applications, including Distributed File System (HDFS), MapReduce, Hive, and HBase. You’ll also gain practical skills in hands-on labs when you query the data added using Hive, launch a single-node Hadoop cluster using Docker, and run MapReduce jobs.

What's included

6 videos1 reading2 assignments3 app items2 plugins

6 videos•Total 37 minutes
  • Introduction to Hadoop•7 minutes•Preview module
  • Intro to MapReduce•5 minutes
  • Hadoop Ecosystem •4 minutes
  • HDFS•8 minutes
  • HIVE•5 minutes
  • HBASE•5 minutes
1 reading•Total 2 minutes
  • Summary and Highlights: Introduction to Hadoop•2 minutes
2 assignments•Total 36 minutes
  • Practice Quiz: Introduction to Hadoop•12 minutes
  • Graded Quiz: Introduction to Hadoop Ecosystem•24 minutes
3 app items•Total 60 minutes
  • Hands-on Lab: Getting Started with Hive•20 minutes
  • Hands-on Lab: Hadoop MapReduce•20 minutes
  • Hands-on lab : Hadoop Cluster (Optional)•20 minutes
2 plugins•Total 30 minutes
  • Cheat Sheet: Introduction to the Hadoop Ecosystem•15 minutes
  • Module 2 Glossary: Introduction to the Hadoop Ecosystem•15 minutes

In this module, you’ll turn your attention to the popular Apache Spark platform, where you will explore the attributes and benefits of Apache Spark and distributed computing. You'll gain key insights about functional programming and Lambda functions. You’ll also explore Resilient Distributed Datasets (RDDs), parallel programming, resilience in Apache Spark, and relate RDDs and parallel programming with Apache Spark. Then, you’ll dive into additional Apache Spark components and learn how Apache Spark scales with Big Data. Working with Big Data signals the need for working with queries, including structured queries using SQL. You’ll also learn about the functions, parts, and benefits of Spark SQL and DataFrame queries, and discover how DataFrames work with Spark SQL.

What's included

5 videos1 reading2 assignments2 app items2 plugins

5 videos•Total 24 minutes
  • Why use Apache Spark?•5 minutes•Preview module
  • Functional Programming Basics•5 minutes
  • Parallel Programming using Resilient Distributed Datasets •5 minutes
  • Scale out / Data Parallelism in Apache Spark•3 minutes
  • Dataframes and SparkSQL•4 minutes
1 reading•Total 2 minutes
  • Summary and Highlights: Introduction to Apache Spark•2 minutes
2 assignments•Total 31 minutes
  • Practice Quiz: Introduction to Apache Spark•10 minutes
  • Graded Quiz: Apache Spark•21 minutes
2 app items•Total 75 minutes
  • Practice Lab: Getting Started with Pyspark and Pandas•60 minutes
  • Hands-on Lab: Getting Started with Spark using Python•15 minutes
2 plugins•Total 30 minutes
  • Cheat Sheet: Apache Spark•15 minutes
  • Module 3 Glossary: Apache Spark•15 minutes

In this module, you’ll learn about Resilient Distributed Datasets (RDDs), their uses in Apache Spark, and RDD transformations and actions. You'll compare the use of datasets with Spark's latest data abstraction, DataFrames. You'll learn to identify and apply basic DataFrame operations. You’ll explore Apache Spark SQL optimization and learn how Spark SQL and memory optimization benefit from using Catalyst and Tungsten. Finally, you’ll fortify your skills with guided hands-on lab to create a table view and apply data aggregation techniques.

What's included

5 videos1 reading2 assignments2 app items4 plugins

5 videos•Total 25 minutes
  • RDDs in Parallel Programming and Spark•5 minutes•Preview module
  • Data-frames and Datasets•4 minutes
  • Catalyst and Tungsten•5 minutes
  • ETL with DataFrames•6 minutes
  • Real-world usage of SparkSQL•4 minutes
1 reading•Total 2 minutes
  • Summary and Highlights: Introduction to DataFrames and Spark SQL•2 minutes
2 assignments•Total 31 minutes
  • Practice Quiz: Introduction to DataFrames & Spark SQL•10 minutes
  • Graded Quiz: DataFrames and Spark SQL•21 minutes
2 app items•Total 30 minutes
  • Hands-on Lab: Introduction to DataFrames•15 minutes
  • Hands-On Lab: Introduction to SparkSQL•15 minutes
4 plugins•Total 60 minutes
  • Reading: User-Defined Schema (UDS) for DSL and SQL•10 minutes
  • Reading: Common Transformations and Optimization Techniques in Spark•20 minutes
  • Cheat Sheet: DataFrames and Spark SQL•15 minutes
  • Module 4 Glossary: DataFrames and Spark SQL•15 minutes

In this module, you’ll explore how Spark processes the requests that your application submits and learn how you can track work using the Spark Application UI. Because Spark application work happens on the cluster, you need to be able to identify Apache Cluster Managers, their components, and benefits. You’ll also know how to connect with each cluster manager and how and when you might want to set up a local, standalone Spark instance. Next, you’ll learn about Apache Spark application submission, including the use of Spark’s unified interface, “spark-submit,” and learn about options and dependencies. You’ll also describe and apply options for submitting applications, identify external application dependency management techniques, and list Spark Shell benefits. You’ll also look at recommended practices for Spark's static and dynamic configuration options and perform hands-on labs to use Apache Spark on IBM Cloud and run Spark on Kubernetes.

What's included

6 videos2 readings3 assignments2 app items4 plugins

6 videos•Total 32 minutes
  • Apache Spark Architecture•5 minutes•Preview module
  • Overview of Apache Spark Cluster Modes•6 minutes
  • How to Run an Apache Spark Application•6 minutes
  • Using Apache Spark on IBM Cloud•4 minutes
  • Setting Apache Spark Configuration•5 minutes
  • Running Spark on Kubernetes •4 minutes
2 readings•Total 4 minutes
  • Summary and Highlights: Spark Architecture•2 minutes
  • Summary and Highlights: Spark Runtime Environments•2 minutes
3 assignments•Total 33 minutes
  • Practice Quiz: Spark Architecture•6 minutes
  • Practice Quiz: Spark Runtime Environments•6 minutes
  • Graded Quiz: Development and Runtime Environment Options•21 minutes
2 app items•Total 80 minutes
  • Hands-on Lab: Submit Apache Spark Applications•60 minutes
  • Hands-on Lab: Apache Spark on Kubernetes•20 minutes
4 plugins•Total 40 minutes
  • Spark Environments - Overview and Options•5 minutes
  • How to Set Up Your Own Spark Environments (Optional)•5 minutes
  • Cheat Sheet: Development and Runtime Environment Options•15 minutes
  • Module 5 Glossary: Development and Runtime Environment Options•15 minutes

Platforms and applications require monitoring and tuning to manage issues that inevitably happen. In this module, you'll learn about connecting the Apache Spark user interface web server and using the same UI web server to manage application processes. You’ll also identify common Apache Spark application issues and learn about debugging issues using the application UI and locating related log files. Further, you’ll discover and gain real-world knowledge about how Spark manages memory and processor resources using the hands-on lab.

What's included

5 videos1 reading2 assignments1 app item3 plugins

5 videos•Total 30 minutes
  • The Apache Spark User Interface•5 minutes•Preview module
  • Monitoring Application Progress•7 minutes
  • Debugging Apache Spark Application Issues•5 minutes
  • Understanding Memory Resources•5 minutes
  • Understanding Processor Resources•5 minutes
1 reading•Total 2 minutes
  • Summary and Highlights: Introduction to Monitoring and Tuning•2 minutes
2 assignments•Total 31 minutes
  • Practice Quiz: Introduction to Monitoring and Tuning•10 minutes
  • Graded Quiz: Monitoring and Tuning•21 minutes
1 app item•Total 30 minutes
  • Hands-on Lab: Monitoring and Performance Tuning•30 minutes
3 plugins•Total 35 minutes
  • [Optional] Batch Data Ingestion Methods•5 minutes
  • Cheat Sheet: Monitoring and Tuning•15 minutes
  • Module 6 Glossary: Monitoring and Tuning•15 minutes

In this module, you’ll perform a practice lab where you’ll explore two critical aspects of data processing using Spark: working with Resilient Distributed Datasets (RDDs) and constructing DataFrames from JSON data. You will also apply various transformations and actions on both RDDs and DataFrames to gain insights and manipulate the data effectively. Further, you’ll apply your knowledge in a final project where you will create a DataFrame by loading data from a CSV file and applying transformations and actions using Spark SQL. Finally, you’ll be assessed based on your learning from the course.

What's included

3 readings1 assignment2 app items2 plugins

3 readings•Total 5 minutes
  • Instructions for the Final Assessment•1 minute
  • Congratulations and Next Steps•2 minutes
  • Thanks from the Course Team•2 minutes
1 assignment•Total 100 minutes
  • Final Assessment•100 minutes
2 app items•Total 120 minutes
  • Practice Project: Data Processing Using Spark•60 minutes
  • Final Project: Data Analysis using Spark•60 minutes
2 plugins•Total 35 minutes
  • Final Project Overview•15 minutes
  • Glossary: Introduction to Big Data with Spark and Hadoop•20 minutes

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV. Share it on social media and in your performance review.

Instructors

Instructor ratings

Instructor ratings

We asked all learners to give feedback on our instructors based on the quality of their teaching style.

4.3 (111 ratings)
Aije Egwaikhide
Aije Egwaikhide
IBM
6 Courses•728,972 learners
Romeo Kienzler
Romeo Kienzler
IBM
10 Courses•767,985 learners
Rav Ahuja
Rav Ahuja
IBM
57 Courses•3,968,052 learners

Offered by

IBM

Offered by

IBM

At IBM, we know how rapidly tech evolves and recognize the crucial need for businesses and professionals to build job-ready, hands-on skills quickly. As a market-leading tech innovator, we’re committed to helping you thrive in this dynamic landscape. Through IBM Skills Network, our expertly designed training programs in AI, software development, cybersecurity, data science, business management, and more, provide the essential skills you need to secure your first job, advance your career, or drive business success. Whether you’re upskilling yourself or your team, our courses, Specializations, and Professional Certificates build the technical expertise that ensures you, and your organization, excel in a competitive world.

Explore more from Data Management

  • Status: Free Trial
    Free Trial
    I

    IBM

    NoSQL, Big Data, and Spark Foundations

    Specialization

  • P

    Packt

    Apache Spark with Scala – Hands-On with Big Data!

    Course

  • Status: Free Trial
    Free Trial
    J

    Johns Hopkins University

    Big Data and Hadoop Foundations and Setup

    Course

  • Status: Free Trial
    Free Trial
    J

    Johns Hopkins University

    Big Data Processing Using Hadoop

    Specialization

Why people choose Coursera for their career

Felipe M.
Learner since 2018
"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."
Jennifer J.
Learner since 2020
"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."
Larry W.
Learner since 2021
"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."
Chaitanya A.
"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."

Learner reviews

4.4

447 reviews

  • 5 stars

    65.99%

  • 4 stars

    19.46%

  • 3 stars

    7.82%

  • 2 stars

    3.13%

  • 1 star

    3.57%

Showing 3 of 447

A
AA
5

Reviewed on Jan 15, 2024

Great program to explore more about AI and Big Data

J
JS
4

Reviewed on May 1, 2022

hands on lab and quizzes at the end of each session was very helpful

M
MG
5

Reviewed on Jul 15, 2023

Course was full of information and details for a beginner In big data technology

View more reviews
Coursera Plus

Open new doors with Coursera Plus

Unlimited access to 10,000+ world-class courses, hands-on projects, and job-ready certificate programs - all included in your subscription

Learn more

Advance your career with an online degree

Earn a degree from world-class universities - 100% online

Explore degrees

Join over 3,400 global companies that choose Coursera for Business

Upskill your employees to excel in the digital economy

Learn more

Frequently asked questions

Access to lectures and assignments depends on your type of enrollment. If you take a course in audit mode, you will be able to see most course materials for free. To access graded assignments and to earn a Certificate, you will need to purchase the Certificate experience, during or after your audit. If you don't see the audit option:

  • The course may not offer an audit option. You can try a Free Trial instead, or apply for Financial Aid.

  • The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

When you enroll in the course, you get access to all of the courses in the Certificate, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile. If you only want to read and view the course content, you can audit the course for free.

If you subscribed, you get a 7-day free trial during which you can cancel at no penalty. After that, we don’t give refunds, but you can cancel your subscription at any time. See our full refund policyOpens in a new tab.

More questions

Visit the learner help center

Financial aid available,

Coursera Footer

Coursera

  • About
  • What We Offer
  • Leadership
  • Careers
  • Catalog
  • Coursera Plus
  • Professional Certificates
  • MasterTrack® Certificates
  • Degrees
  • For Enterprise
  • For Government
  • For Campus
  • Become a Partner
  • Social Impact

Community

  • Learners
  • Partners
  • Beta Testers
  • Blog
  • The Coursera Podcast
  • Tech Blog

More

  • Press
  • Investors
  • Terms
  • Privacy
  • Help
  • Accessibility
  • Contact
  • Articles
  • Directory
  • Affiliates
  • Modern Slavery Statement
  • Do Not Sell/Share
Learn Anywhere
Download on the App Store
Get it on Google Play
Logo of Certified B Corporation
© 2025 Coursera Inc. All rights reserved.
  • Coursera Facebook
  • Coursera Linkedin
  • Coursera Twitter
  • Coursera YouTube
  • Coursera Instagram
  • Coursera TikTok
Coursera

Sign up

Learn on your own time from top universities and businesses.

​
​
Between 8 and 72 characters
Your password is hidden
​

or

Already on Coursera?


Having trouble logging in? Learner help center

This site is protected by reCAPTCHA Enterprise and the Google Privacy Policy and Terms of Service apply.