5 things you need to know about Machine Learning Systems

The more I work on building end-to-end machine learning (ML) pipelines, the more I realize the importance of system design and infrastructure. ML shares many concerns with traditional software development, and poses new challenges to system design.

To learn more about Machine Learning Systems (MLSys), I started to take the Machine Learning Systems course taught by Professor Joseph E. Gonzalez at UC Berkeley. The course’s website has a comprehensive reading list and slides. The Spring semester syllabus focuses on the architecture, hardware, software, and performance of AI systems, including classic papers on neural nets, AutoML, etc. The Fall semester syllabus (latest) takes a different approach and focuses on different research areas in MLSys, rather than individual applications. While there is a fair amount of overlapping in the content between the two semesters, I find the Fall semester provides a high-level and structured outline of MLSys, and the Spring semester pays great attention to the technical details of MLSys in development. I recommend that ML practitioners start from the Fall semester to have an understanding of the big picture in MLSys, and refer to the reading list in the Spring semester to gain more in-depth technical insights.

I am currently on week 2 of the Fall semester syllabus. In this post, I summarize 5 things you need to know about ML systems. 

1. MLSys is among the driving forces of recent AI revolution.

The last two decades witness a fast wave of AI revolution in image recognition, speech recognition, natural language processing, reinforcement learning, recommender system, etc. Here is my opinionated list of the MLSys milestones in recent years.

Most of these new developments are driven by the following forces:

1) large data are collected, such as ImageNet, Google search, Netflix watching history, Facebook Ad feedback, and self-driving cars. Large data enables the training of complex models.  

2) hardware systems are developed for large data storage and processing. I got my first GPU, NVIDIA – GeForce GTX 970, in 2016, to satisfy the graphic requirement for a video game called “Overwatch”. New generations of GPUs and later TPUs are developed for more efficient data processing and matrix computation. In addition, robust and cost-efficient data storage and management on commercial cloud platform alleviates the heavy overhead of database management. 

3) software frameworks are developed to handle large data processing and model training more efficiently. Scikit-learn is a good start for ML practice, until you realize it does not allow distributed data processing and training. Apache Spark, with its uniquely designed RDD data structure, allows fast data manipulation at large scale, and the Spark MLlib package has been widely used in industry to handle machine learning models. Recent years, with the development of deep learning, various open source frameworks such as Theano, TensorFlow, PyTorch, MXNet, have been developed to handle hardware-software integration, abstraction, and differentiation. 

4) new algorithms and models are invented, such as DQN, A3C, and self-play in reinforcement learning, Word2Vec, BERT, and ELMo in natural language processing, VGG, ResNet, and Inception net in image processing. 

2. MLSys is an emerging interdisciplinary field.

ML has always been an interdisciplinary field: statistics, mathematics, computer science, operation research, etc. Until recently, ML education and courses focuses mostly on developing models and algorithms for inference and prediction: if you have taken Andrew Ng’s Coursera courses, you could name 3 supervised learning algorithms or 3 neural networks easily, and with a few days of hands-on exercise, use Scikit-learn or Tensorflow package to build a prototype from data preprocessing to model evaluation. With a few more hacking, you could replicate Netflix’s matrix factorization solution for a basic recommender system. A lot of  ML projects on Kaggle and school projects provide great insights in data processing and model design (such as the magic ensemble method).

However, once we start to work in a real world to develop ML products, even in a “simple” case similar to Predict Survival Rate on Titanic, we would immediately realize our knowledge in ML does not meet the practical requirement to ship the ML solution we have proudly developed in Jupyter notebooks. 

To name a few non-ML tasks: how to collect, partition, and store data, how to use database, how to build robust and production-level data pipelines, how to distribute a training job to make it scale well with large amount of data, how to serve a model in real time with a model that is trained on batch, how to monitor the performance of a model and identify drift. Take a step back, how to write good code with proper naming, modules, abstraction, test, and version control. Take a step forward, how to design a ML system with a trade-off between cost, latency, and performance, and how to ensure ML systems are fair, secure, and explainable. 

In the face of these challenges and new research frontiers, the MLSys conference has been proposed. The 2020 MLSys will be held on March 2-4. 

3. MLSys is complex.

While this seems to be a self-explanatory statement, I want to emphasize the origin of complexity in MLSys. As described in Principles of Computer System Design: An Introduction, a complex system has the following signs:

1) large number of components: most modern ML models have large data and complex models with many parameters, and a ML pipeline consists of multiple stages from preprocessing to training to evaluation. The sheer size of a ML system (or a Github repository) not only makes it difficult for an individual to grasp the big picture, but also makes it hard to debug when something goes odd: maybe it is the model, maybe it is the feature, or maybe it is the database. 

2) large number of interconnections: ML pipelines usually have modules connected to each others: bad features result in bad model performance, no matter how much you tune the model. And when the serving output looks odd, try to see if you can easily identify the root cause in a post-hoc analysis. 

3) large number of irregularity: in other words, exceptions, outliers, missing values, as well as a long-tail distribution of latency. Those who appreciate the blessing and curse of sparse features understand the meaning of irregularity.

4) a long description: just try to explain your project to a coworker, who is also a ML expert, but not on your project. Or maybe explain your project to a non-tech friend or to your parents at a family dinner. Here is what I say to my parents:

every time you watch a streaming video for free, there are ads showing up during the video… my job is to decide the best ad to show to you so that you are more likely to click it, based on things we know about you such as your watching history… although in my case, it is not a video app, it is a music app.” 

As you can tell, it is anything but short. 

5) a team of designers, implementers, maintainers: unfortunately (but not surprisingly) these 3 roles are usually not the same person. The designers sketch the grand picture and framework, implementers code it out and ship it, and maintainers get lost in the maze and scratch their head trying to debug. When things go well, all 3 roles efficiently communicate to make sure the MLSys evolves. When things go down, it can easily become a blame game, and each party can be reluctant to take responsibility. 

4. ML and system research are mutually beneficial.

As described in part 1, the integration of hardware and software system in ML accelerates the AI revolution. On the other hand, ML models have been adopted in traditional system design for better hardware and software: machine learning for systems.

In the NeurIPS conference last month (December 2019), a workshop on ML for Systems was organized to “improve the state of the art in the areas where learning has already proven to perform better than traditional heuristics, as well as expand to new areas throughout the system stack such as hardware/circuit design and operating/runtime systems.”

5. There is a strong need for full-stack MLSys experts.

From my own experience as well as talking with peers in the industry, one observation becomes apparent: companies are looking for employees who are not only knowledgeable in ML concepts and algorithms, but also competent in coding and deploying data pipelines in production. Various internal training and bootcamps are organized for traditional software engineers on ML, and for traditional data scientists on data engineering, aiming to create the ideal full-stack ML experts who can develop end-to-end pipelines.

In addition to technical expertise, ML experts are expected to collaborate widely with backend engineers, data analysts, and product managers, communicate system-level requirement and development, to deliver a final ML product. The high complexity and unprecedented novelty of ML products is often accompanied with unexpected change of scope, delayed timeline and milestone, constantly evolving requirements, and ad-hoc analysis and debugging. As a result of such challenges, communication and coordination is as important as technical knowledge in MLSys. 

Cover image by Gerd Altmann from Pixabay


  1. Hi Yu, thank you for this blog! I’m a DS from FB and I’m planning to take the same course, however, when visiting the course website, I cannot find video links, did you by any chance find course videos?

    1. Nope, I remember I only read the slides and recommended papers. But I was able to find some relevant videos on YouTube from the lecturer.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.