A Fictional Cab Service

Weekly Cloud Computing Assignment, Spring 2019

Processing a large stream of data using Kafka & Samza.
Predicting cab fares using Machine Learning.

Overview

Stream Processing:

Systems with extreme latency and throughput requirements simply cannot be managed efficiently by batch-based frameworks such as MapReduce or Apache Spark. Apache Kafka and Apache Samza are two such systems which enable processing a stream of data in real-time (with low latency).

Stream Processing System Overview:

Machine Learning:

Explored and visualized the data-set to construct effective features through the feature engineering process. Trained an ML model (XGBoost (eXtreme Gradient Boosting)) using a cloud-based training and tuning solution. Built an end-to-end application that accepted speech queries about the fare of car rides and responded with a speech-based answer. Utilized multiple APIs to build this application including the ML model built earlier. A variety of ML models were stitched together in order to build a complex end-to-end solution and extend its functionality.

ML System Overview:

Technologies used:

  • Java
  • Apache Kafka
  • Apache Samza
  • Node.js
  • deck.gl
  • react-map-gl
  • Python
  • Flask
  • Jupyter Notebook
  • GCP ML APIs
  • Tensorflow
  • Maven
  • Terraform
  • AWS

Kafka & Samza

Kafka included four core APIs: Producer API, Consumer API, Streams API and Connect API. The Producer API allowed applications to send streams of data to topics in the Kafka cluster. The Consumer API allowed applications to read streams of data from topics in the Kafka cluster. The Streams API allowed transforming streams of data from input topics to output topics. And the Connect API allowed implementing connectors that continually pull from some source system or application into Kafka or push from Kafka into some sink system or application.

The system got a live stream of data where drivers updated their locations on a regular basis and clients requested for rides. The program parsed those events and wrote the messages to the Kafka topics. These topics resided in a Samza cluster where a Samza job processed these streams for client-driver matching.

Machine Learning

First, I explored the dataset, and cleaned outliers or missing values. I performed feature engineering to extract or construct meaningful features. Performing these two steps significantly improved the accuracy over the baseline model.

I took advantage of the elastic scaling features offered by Google Cloud Machine Learning Engine to train the model and used the built-in automatic hyperparameter tuning feature to improve the accuracy of the predictions.

Used Cloud Vision and AutoML API to identify common NYC landmarks and their map coordinates.

GCP ML APIs used:

  • App Engine Flexible Environment
  • ML Engine
  • Text-to-Speech
  • Speech
  • Natural Language
  • Directions
  • Cloud Vision
  • AutoML

Outcome

Visualization using deck.gl and react-map-gl:

Learnings

Explored and understood the characteristic of IoT stream data, data flow and fault tolerance model of Kafka and Samza.

Practiced test-driven development (TDD) with JUnit on Kafka when developing solutions for handling big data.

Designed and implemented a solution to join and process multiple streams of GPS, IoT and static data using Samza API to setup an advertising system.

Experimented with deploying a Kafka and Samza system on the YARN cluster provisioned on the cloud. Managed and monitored Samza jobs on Yarn.

Constructed new features using feature engineering methods to improve the accuracy of an ML-based predictor and evaluated impact of different features on the accuracy of the predictor.

Performed hyperparameter tuning on Google ML Engine to improve the accuracy of the predictor.

Deployed and evaluated an end-to-end solution that requires a pipeline of cloud ML APIs and custom trained ML models on Google App Engine (GAE).

More Details

Academic Integrity restricts me to go into details about the implementation and my unique approach to the project. If you would like to learn about the details, please contact me.
I'm always happy to chat.