· spark scala data streaming

Twitter Streaming with Spark and Scala

Udemy- Apache Spark 2.0 With Scala >> Twitter Streaming

Go to: DataChatter | TweetLength


This is a Spark streaming script that monitors live tweets from Twitter and keeps track of the 10 most popular hashtags as tweets are received.

Each hashtag is mapped to a key/value pair of (hashtag, 1) so they can be counted up over a 5-minute sliding window with this line of code:

val hashtagCounts = hashtagKeyValues.reduceByKeyAndWindow((x,y) => x + y, (x,y) => x - y, Seconds(300), Seconds(1))

Sample Output



I experimented a bit further by creating a new script that pulls live tweets from Twitter related to the word ‘data’. I initially wanted to pull tweets related to data engineering, but the volume of tweets with both of these words was much more sparse and pretty much entirely from recruiters.

This time instead of measuring the popularity of hashtags, I wanted to see the most popular words in tweets on the topic of data. I wanted to only include meaningful words in the results so I introduced a file of stop words and symbols to filter out of the tweet words. I also excluded the word ‘data’ itself since these tweets all have that word in common. These were the most common words pertaining to data:



Another self-challenge I completed was creating a new script to see what the most common length of live tweets is. I again used a 5-minute window that slides every one second. These were the most common number of characters per tweet at runtime:


View my code on GitHub

About Me

I'm a data engineering manager working to advance data-driven cultures by wrangling disparate data sources and empowering end users to uncover key insights that tell a bigger story. LEARN MORE >>

comments powered by Disqus