Time to Spark

After investigating a sample dataset using Python for a month on a local computer, I discovered some interesting patterns and optimized my algorithms to achieve a high measure of goodness. This week, I finally got the opportunity to run my algorithms on a much larger dataset – too large to be stored or processed on a single computer. Finally, it is time to Spark!

As the world becomes increasingly digitized, mobile, and smart, more and more devices get connected to internet: from the most common smartphones and personal computers, to Tesla, iRobot vacuum and Amazon Echo. These devices not only facilitate our life, but also reflect a virtual image of ourselves. Due to their connective nature, devices can potentially be unified to create a holistic figure for product personalization, recommendation system, and optimization.

It is one thing to conceptualize a good business idea, but a totally different story to make it happen, for real.

The challenge comes from data themselves. You might have read about the astonishing amount of data produced in modern world (2.5 Exabyte per day in 2016, which is equivalent to 90 years of HD videos). The reality is more complicated. Not all data are HD videos. They come in all different formats, types, and volumes. Some of them are well structured such as balance sheet and log records, while most data are unstructured or less structured such as texts, images, and sound. Different data structures require different storage formats and infrastructure in order to be used more effectively later.

Then, when data are too large to be stored on a single computer at low cost, as is the case for Tapad, a scalable file system is a must. Apache Hadoop HDFS was invented by Doug Cutting and Mike Cafarella in 2005, to solve this very issue. HDFS provides scalable and reliable data storage, by distributing data across large clusters of commodity servers. It becomes widely used in tech industry to store large amount of data since its release.

After building the crucial infrastructure to store data, now we need to “turn data into knowledge” (by a philosopher colleague). It is a big challenge to process large data stored on HDFS. While Hadoop provides a solution, its two-stage MapReduce paradigm tends to be time consuming. Apache Spark (released in 2014) provides an alternative cluster-computing framework with distributed data parallelism. Spark is faster than Hadoop because Spark runs in-memory on the cluster, and is not tied to Hadoop’s MadReduce paradigm. Spark has APIs available to a variety of programming languages, including Scala, Python, and R, making it a great tool for data scientists to analyze big data and turn it into knowledge. At Tapad, engineers have used Spark and successfully built DeviceGraph using real time big data, providing a unified view of users.

After learning Spark theories from the Coursera course Big Data Analysis with Scala and Spark, with the guidance of my mentor, I ran my first Spark script on a single HDFS file on Tapad’s server. Not surprisingly, my first run failed. We spent hours troubleshooting by looking at the error logs and changing my code, and eventually the word “SUCCEEDED” appeared on the screen!

My Spark script is running on 100 HDFS files over the weekend. Hope I have some good news on Monday.

Ju Yang

Ph.D. / Machine Learning Practitioner in New York

One comment:

Leave a Reply Cancel reply