Mar 2018

Creating Similar Movies from One Million Ratings on EMR

Spark_Scala
Udemy- Apache Spark 2.0 With Scala >> Creating Similar Movies from One Million Ratings on EMR

I completed an exercise of compiling a .jar file from my Scala code through IntelliJ IDEA, uploading this to a S3 bucket I created, and then running a Spark job on a AWS EMR cluster that I spun up. Here are the steps I took with screenshots, since I terminated my EMR cluster at the end to avoid incurring unnecessary costs.

Create S3 Bucket

Set up a S3 bucket to store log files from my EMR cluster, the .jar executable, and the .dat files from the MovieLens 1M dataset.

S3 Bucket

Create EMR Cluster

I created a new EMR cluster with the following configurations initially, but I ended up actually having to tweak the instance type setting from m3.xlarge to m1.medium for the particular availability zone I used. Still created 1 master and 2 slaves nodes in the cluster (3 total instances).

Creating EMR Cluster

EMR Cluster Summary

Running the Spark Job on the EMR Cluster

Once my new EMR cluster was available for use, I tried to connect to it with SSH from my MacBook. On the first attempt, the operation timed out:

mymachine:~ andgoss$ ssh -i ~/.credentials/ag-spark.pem [email protected]
ssh: connect to host ec2-34-236-254-148.compute-1.amazonaws.com port 22: Operation timed out

To address this, I navigated to the security group for the master node of my EMR cluster (from the EMR summary tab). From here, I edited the inbound rules to add a new one to open up the port to SSH from my computer’s IP address:

Open SSH Port

Here is the terminal output from successfully running this Spark job across my (temporary) EMR cluster. I removed the repetitive warnings to make this more readable. Once this finished running I terminated this cluster through the AWS console.

About Me

I'm a data leader working to advance data-driven cultures by wrangling disparate data sources and empowering end users to uncover key insights that tell a bigger story. LEARN MORE >>

Andrew Goss

About

Resume

Resources

Tags

Creating Similar Movies from One Million Ratings on EMR

Create S3 Bucket

Create EMR Cluster

Running the Spark Job on the EMR Cluster

About Me