Creating Similar Movies from One Million Ratings on EMR
Udemy- Apache Spark 2.0 With Scala >> Creating Similar Movies from One Million Ratings on EMR
I completed an exercise of compiling a .jar file from my Scala code through IntelliJ IDEA, uploading this to a S3 bucket I created, and then running a Spark job on a AWS EMR cluster that I spun up. Here are the steps I took with screenshots, since I terminated my EMR cluster at the end to avoid incurring unnecessary costs.
Create S3 Bucket
Set up a S3 bucket to store log files from my EMR cluster, the .jar executable, and the .dat files from the MovieLens 1M dataset.
Create EMR Cluster
I created a new EMR cluster with the following configurations initially, but I ended up actually having to tweak the instance type setting from m3.xlarge to m1.medium for the particular availability zone I used. Still created 1 master and 2 slaves nodes in the cluster (3 total instances).
Running the Spark Job on the EMR Cluster
Once my new EMR cluster was available for use, I tried to connect to it with SSH from my MacBook. On the first attempt, the operation timed out:
mymachine:~ andgoss$ ssh -i ~/.credentials/ag-spark.pem [email protected] ssh: connect to host ec2-34-236-254-148.compute-1.amazonaws.com port 22: Operation timed out
To address this, I navigated to the security group for the master node of my EMR cluster (from the EMR summary tab). From here, I edited the inbound rules to add a new one to open up the port to SSH from my computer’s IP address:
Here is the terminal output from successfully running this Spark job across my (temporary) EMR cluster. I removed the repetitive warnings to make this more readable. Once this finished running I terminated this cluster through the AWS console.
I'm a senior data engineer working to advance data-driven cultures by wrangling disparate data sources and empowering end users to uncover key insights that tell a bigger story. LEARN MORE >>
comments powered by Disqus