· cloud computing gcp unstructured data

Coursera- Leveraging Unstructured Data with Cloud Dataproc on GCP


COURSE LINK
COURSE CERTIFICATE
GCP Professional Data Engineer Certification >> Leveraging Unstructured Data with Cloud Dataproc


Modules & Lab Exercises

Note: These exercises were spun up in temporary cloud instances and thus are no longer available for viewing.

Module 1: Introduction to Cloud Dataproc

What Qualifies as Unstructured Data?

Dataproc Eases Hadoop Management

Cloud Dataproc Architecture

Lab 1: Create a Dataproc Cluster

Module 2: Running Dataproc jobs

Open-Source Tools on Dataproc

Data Center Power through GCP

Serverless Platform for Analytics Data Lifecycle Stages

Separation of Storage and Compute Enables Serverless

Separation of Storage and Compute for Spark Programs

Lab 2: Work with structured and semi-structured data

Lab 3: Submit Dataproc jobs for unstructured data

File information - road-not-taken.txt

This shows that the file fits into a single HDFS block. Notice from the Block Information pulldown, that the file is only located in Block 0. And that the block is duplicated on both worker node 0 and worker node 1.

Module 3: Leveraging GCP

Hadoop/Spark Jobs Read From BigQuery Through Temp GCS Storage

BigQuery Integration Using Python Pandas

Lab 4: Leverage GCP

Why would you want to use Cloud Storage instead of HDFS?

You can shut down the cluster when you are not running jobs. The storage persists even when the cluster is shut down, so you don’t have to pay for the cluster just to maintain data in HDFS. In some cases Cloud Storage provides better performance than HDFS. Cloud Storage does not require the administration overhead of a local file system.

You can make the cluster stateless by keeping all the persistent data off-cluster. And this means (a) the cluster can be shut down when not in use, solving the Hadoop utilization problem, and (b) a cluster can be created and dedicated to a single job, solving the Hadoop configuration and tuning problem.

Dataproc OSS on GCP

Lab 5: Cluster automation using CLI commands

Module 4: Analyzing Unstructured Data

Pretrained Models

Lab 6: Add Machine Learning

View my code on GitHub

Resources


About Me

I'm a data engineer working to advance data-driven cultures by integrating disparate data sources and empowering end users to uncover key insights that tell a bigger story. LEARN MORE >>



comments powered by Disqus