· spark scala data streaming

Twitter Streaming with Spark and Scala

Spark_Scala
Udemy- Apache Spark 2.0 With Scala >> Twitter Streaming


Go to: DataChatter | TweetLength

PopularHashtags.scala

This is a Spark streaming script that monitors live tweets from Twitter and keeps track of the 10 most popular hashtags as tweets are received.

Each hashtag is mapped to a key/value pair of (hashtag, 1) so they can be counted up over a 5-minute sliding window with this line of code:

val hashtagCounts = hashtagKeyValues.reduceByKeyAndWindow((x,y) => x + y, (x,y) => x - y, Seconds(300), Seconds(1))

Sample Output

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/03/21 11:05:05 INFO SparkContext: Running Spark version 2.2.0
18/03/21 11:05:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/03/21 11:05:06 WARN Utils: Your hostname, 'mymachine' resolves to a loopback address: 127.0.0.1; using 192.168.1.105 instead (on interface en0)
18/03/21 11:05:06 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
18/03/21 11:05:06 INFO SparkContext: Submitted application: PopularHashtags
18/03/21 11:05:06 INFO SecurityManager: Changing view acls to: andgoss
18/03/21 11:05:06 INFO SecurityManager: Changing modify acls to: andgoss
18/03/21 11:05:06 INFO SecurityManager: Changing view acls groups to:
18/03/21 11:05:06 INFO SecurityManager: Changing modify acls groups to:
18/03/21 11:05:06 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(andgoss); groups with view permissions: Set(); users with modify permissions: Set(andgoss); groups with modify permissions: Set()
18/03/21 11:05:07 INFO Utils: Successfully started service 'sparkDriver' on port 55527.
18/03/21 11:05:07 INFO SparkEnv: Registering MapOutputTracker
18/03/21 11:05:07 INFO SparkEnv: Registering BlockManagerMaster
18/03/21 11:05:07 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
18/03/21 11:05:07 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
18/03/21 11:05:07 INFO DiskBlockManager: Created local directory at /private/var/folders/gp/bcfcnndd1r560tcdwqpqr4t5ysdjsk/T/blockmgr-7bc076c8-378f-4be4-83b7-9ee59feac8a7
18/03/21 11:05:07 INFO MemoryStore: MemoryStore started with capacity 2004.6 MB
18/03/21 11:05:07 INFO SparkEnv: Registering OutputCommitCoordinator
18/03/21 11:05:07 INFO Utils: Successfully started service 'SparkUI' on port 4040.
18/03/21 11:05:07 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.1.105:4040
18/03/21 11:05:08 INFO Executor: Starting executor ID driver on host localhost
18/03/21 11:05:08 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 55528.
18/03/21 11:05:08 INFO NettyBlockTransferService: Server created on 192.168.1.105:55528
18/03/21 11:05:08 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
18/03/21 11:05:08 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.1.105, 55528, None)
18/03/21 11:05:08 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.105:55528 with 2004.6 MB RAM, BlockManagerId(driver, 192.168.1.105, 55528, None)
18/03/21 11:05:08 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.1.105, 55528, None)
18/03/21 11:05:08 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.1.105, 55528, None)
-------------------------------------------
Time: 1521644740000 ms
-------------------------------------------
(#บุพเพสันนิวาส,21)
(#GOT7,6)
(#워너원,5)
(#동방신기,5)
(#FavMusicalGroupTwentyOnePilots,3)
(#からだすこやか茶W,3)
(#JTMA2018,3)
(#KCA,3)
(#GalaxyGift,3)
(#라이관린,3)
...
-------------------------------------------
Time: 1521644741000 ms
-------------------------------------------
(#บุพเพสันนิวาส,22)
(#GOT7,6)
(#워너원,5)
(#GalaxyGift,5)
(#동방신기,5)
(#FavMusicalGroupTwentyOnePilots,3)
(#からだすこやか茶W,3)
(#JTMA2018,3)
(#KCA,3)
(#라이관린,3)
...
-------------------------------------------
Time: 1521644742000 ms
-------------------------------------------
(#บุพเพสันนิวาส,25)
(#GOT7,6)
(#워너원,5)
(#GalaxyGift,5)
(#동방신기,5)
(#KCA,4)
(#FavMusicalGroupTwentyOnePilots,3)
(#からだすこやか茶W,3)
(#JTMA2018,3)
(#라이관린,3)
...
-------------------------------------------
Time: 1521644743000 ms
-------------------------------------------
(#บุพเพสันนิวาส,25)
(#GOT7,6)
(#워너원,5)
(#GalaxyGift,5)
(#동방신기,5)
(#KCA,4)
(#FavMusicalGroupTwentyOnePilots,3)
(#からだすこやか茶W,3)
(#JTMA2018,3)
(#라이관린,3)
...

Takeaways

DataChatter.scala

I experimented a bit further by creating a new script that pulls live tweets from Twitter related to the word ‘data’. I initially wanted to pull tweets related to data engineering, but the volume of tweets with both of these words was much more sparse and pretty much entirely from recruiters.

This time instead of measuring the popularity of hashtags, I wanted to see the most popular words in tweets on the topic of data. I wanted to only include meaningful words in the results so I introduced a file of stop words and symbols to filter out of the tweet words. I also excluded the word ‘data’ itself since these tweets all have that word in common. These were the most common words pertaining to data:

-------------------------------------------
Time: 1521748832000 ms
-------------------------------------------
(facebook,62)
(cambridge,22)
(zuckerberg,20)
(analytica,18)
(weather,17)
(mark,15)
(people,15)
(pm,14)
(obama,13)
(over,12)
...
-------------------------------------------
Time: 1521748834000 ms
-------------------------------------------
(facebook,65)
(cambridge,24)
(zuckerberg,21)
(analytica,19)
(weather,18)
(mark,16)
(people,16)
(pm,14)
(over,13)
(obama,13)
...
view raw DataChatter.jar hosted with ❤ by GitHub

Takeaways

TweetLength.scala

Another self-challenge I completed was creating a new script to see what the most common length of live tweets is. I again used a 5-minute window that slides every one second. These were the most common number of characters per tweet at runtime:

-------------------------------------------
Time: 1522165838000 ms
-------------------------------------------
(140,94)
(139,16)
(138,4)
(120,3)
(109,3)
(131,3)
(84,2)
(112,2)
(80,2)
(124,2)
...
-------------------------------------------
Time: 1522165839000 ms
-------------------------------------------
(140,96)
(139,16)
(138,4)
(120,3)
(109,3)
(131,3)
(84,2)
(112,2)
(80,2)
(124,2)
...
view raw TweetLength.jar hosted with ❤ by GitHub

Takeaways

View my code on GitHub


About Me

I'm a data leader working to advance data-driven cultures by wrangling disparate data sources and empowering end users to uncover key insights that tell a bigger story. LEARN MORE >>