Algorithms and Software System for Analysis of Twitter Data using Apache Spark
The goal of this project is to develop a software system to collect, store, organize and query Twitter messages, and to develop algorithms that can process the Twitter data to extract value-added information, in particular, the geolocation of Tweets. First, we will design and implement a processing and analytics system for Twitter data using the Apache Spark environment. Second, we will research and extend advanced algorithms to infer the geolocation of Tweets from their contents. This will benefit Spotzi as they are developing and selling new media analysis products with a focus on geography-related applications. Novel use of the Spark environment will allow Spotzi to expand its services to a larger scale in terms of data sizes and processing capacity, by relying on the scalability of Spark as an efficient distributed computing environment. Additionally, since many of Spotzi’s products are related to geographic information, enhancing the geolocation information of Spotzi’s data is important for the accuracy of their analytics products.