Mar 2018

Udemy- Apache Spark 2.0 With Scala

Spark_Scala
COMPLETION CERTIFICATE
COURSE MATERIALS

Description

Learn the concepts of Spark’s Resilient Distributed Datastores
Get a crash course in the Scala programming language
Develop and run Spark jobs quickly using Scala
Translate complex analysis problems into iterative or multi-stage Spark scripts
Scale up to larger data sets using Amazon’s Elastic MapReduce service
Understand how Hadoop YARN distributes Spark across computing clusters
Practice using other Spark technologies, like Spark SQL, DataFrames, DataSets, Spark Streaming, and GraphX

What I got from this course

Frame big data analysis problems as Apache Spark scripts
Develop distributed code using the Scala programming language
Optimize Spark jobs through partitioning, caching, and other techniques
Build, deploy, and run Spark scripts on Hadoop clusters
Process continual streams of data with Spark Streaming
Transform structured data using SparkSQL and DataFrames
Traverse and analyze graph structures using GraphX

Target Audience

Data engineers who want to expand their skills into the world of big data processing on a cluster

What is Spark?

A fast and general engine for large-scale data processing
Runs programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
Built around one main concept: the Resilient Distributed Dataset (RDD) – basically an abstraction over a giant set of data
- Spark context - created by your driver program
- Creates RDDs - responsible for making RDDs resilient and distributed!
- Spark shell creates a “sc” object for you
- Nothing actually happens in a RDD until you call an action on it – lazy evaluation
- Spark can do special stuff with key/value data (reduceByKey(), groupByKey(), sortByKey(), keys(), values())
- You can do SQL-style joins on two key/value RDDs

Why Scala for Spark?

Spark itself is written in Scala
Scala’s functional programming model is a good fit for distributed processing
Gives you fast performance (Scala compiles to Java bytecode)
Less code & boilerplate stuff than Java
Python is slow in comparison

Key Concepts

Breadth First Search (BFS)

Similarity metric algorithms:
- Item-based collaborative filtering - used for recommending similar items by finding relationships based on user behavior
- Pearson correlation coefficient
- Jaccard coefficient
- Conditional probability

How Directed Acyclic Graph (DAG) works underneath the hood on Resilient Distributed Datasets (RDDs)

RDD dataframes (for working with structured data):
- Contains row objects
- Can run SQL queries
- Has a schema (leading to more efficient storage)
- Read and write to JSON, Hive, parquet
- Communicates with JDBC/ODBC, Tableau
The trend in Spark is to use RDD’s less, and DataSets more.
DataSets are more efficient
- They can be serialized very efficiently - even better than Kryo
- Optimal execution plans can be determined at compile time
DataSets allow for better interoperability
- MLLib and Spark Streaming are moving towards using DataSets instead of RDDs for their primary API
DataSets simplify development
- You can perform most SQL operations on a dataset with one line
Structured streaming - uses DataSets as its primary API. Imagine a DataSet that keeps getting appended to forever and you can query it whenever you want.

Datasets

MovieLens

Hands-On Exercises

Course Progress

100% - completed 3/26/18.

View my code on GitHub

About Me

I'm a data leader working to advance data-driven cultures by wrangling disparate data sources and empowering end users to uncover key insights that tell a bigger story. LEARN MORE >>

Andrew Goss

About

Resume

Resources

Tags