Twitter Analytics on Cloud

Cloud Computing (15-619) Team Project, Spring 2019

A highly performant, scalable, and reliable analytical web service on a terabyte of raw tweets.

Overview

Web Tier:

We built a web service to receive and respond to incoming HTTP GET requests with suitable responses (as defined in the Query Types section below). The web service ran smoothly for several hours handling thousands of requests per second and did not refuse queries and should tolerate a heavy load.

Storage Tier:

We used SQL (MySQL) and NoSQL (HBase) databases in the first two phases of this project and compared their performance for different query types for different dataset sizes and then decide on an appropriate storage-tier for the final system.

Restrictions:

We were only allowed to used the M family AWS instances and General Purpose SSD Volumes(gp2) for instance storage. We used spot instances for batch jobs and development.

Dataset & ETL:

The Twitter dataset was larger than 1 TB after decompressing. The duplicate or malformed records were accounted for. We performed the ETL on GCP using Spark + Scala.

Technologies used:

  • Java
  • Vert.x
  • Undertow
  • Multithreading
  • Scala
  • Spark
  • HBase
  • MySQL
  • DynamoDB
  • Maven
  • Terraform
  • Google Cloud Platform
  • Amazon Web Services
  • AWS Lightsail
  • AWS Fargate
  • AWS Lambda
  • AWS ECR

Query 1 - Blockchain

Query 1 is a ramp-up task which made sure the team had a working web service.

We were given the task to implement the blockchain network (a completely decentralized network - instead of relying on or be subject to the authority of a single entity, this currency system should exist in every corner of the Internet) to avoid the worst case scenarios when banks make bad investments and burn up all their clients' money - in a distributed network, where every participant in the network should be able to verify the validity of transactions.

Cryptography: We used asymmetric cryptography, also known as public key cryptography, uses public and private keys to encrypt and decrypt data. The keys are simply large numbers that have been paired together but are not identical (asymmetric).

We got each request as a zlib compressed and encoded by URL-safe Base64 in JSON format and we sent the response for each request along with the proof-of-work and the RSA Signature.

Achieved RPS: 45,000.

Query 2 - User Recommendation

The web service recommended some close friends of the user in the request from the Twitter Dataset. The request contained the userid from the dataset and the response was based on the score calculated for all the other users with respect to the userid in the request.

The ETL was the most crucial part of this query, and also the most tedious one. Taking the dataset size into consideration (1 TeraByte+), we chose Spark to load the data into MySQL and HBase.

Achieved RPS: 12,000.

Query 3 - Topic Words Extraction

Digging further into MySQL, HBase, and your web-tier framework, the third query was to find the most effective topic words in a given timeframe.

  • Find all the tweets posted by a user within the uid range and within the given time range.
  • Calculate the topic score, which is a modified version of TF-IDF to extract the topic words.
  • Sort topic words and tweets, which must contain at least one of those topic words.

For this query, we had to the ETL all over again based on the new requirements. We also used managed AWS services for this query.

Achieved RPS: 6,000.

Outcome

Achieved Maximum RPS
Mixed Live Test for Query 1

Learnings

Built a performant and reliable web service on the cloud within a specified budget.

Designed, developed, deployed, and optimized functional web-servers that can handle a high load (~ tens of thousands of requests per second).

Implemented, Extracted, Transformed and Loaded (ETL) on a large data set (~ 1 TB) and load the data into MySQL and HBase systems.

Designed schema as well as configured and optimized MySQL and HBase databases to deal with scale and improve throughput.

Explored methods to identify the potential bottlenecks in a cloud-based web service and methods to improved system performance.

Developed fault-tolerant, scalable web-servers to respond to a live load.

Explored various methods, tools, configurations and optimizations to improve the performance of a web service deployed on cloud managed services.

Configured and deployed a managed storage-tier to handle range read queries.

More Details

Academic Integrity restricts me to go into details about the implementation and our unique approach to the project. If you would like to learn about the details, please contact me.
I'm always happy to chat.