Reading Notes: Infrastructure for Usable Machine Learning: The Stanford DAWN Project

Below is my reading notes for the manuscript here.


building machine learning applications remains prohibitively time-consuming and expensive for all but the best-trained, best-funded engineering organizations

new ML applications are very expensive to build

  • Every major new ML product, such as Apple Siri, Amazon Alexa, or Tesla Autopilot requires large and costly teams of domain experts, data scientists, data engineers, and DevOps
  • Even within organizations that have successfully employed ML, ML remains a rare and expensive commodity reserved for a small subset of teams and applications
  • Many ML models require huge amounts of training data, and obtaining such training data is highly challenging in many application domains
  • Finally, once an ML product is built, it requires substantial effort to deploy, operate, and monitor fffat scale, especially if critical business processes will rely on it.


ML technology is at a similar stage to early digital computers, where armies of white-clad technicians labored to keep a small handful of machines operating in production: ML technology clearly has tremendous potential, but today, ML-powered applications are far too expensive to build for most domains.

Goal of DAWN

DAWN: Data Analytics for What’s Next

Goal is not to improve ML algorithms, but instead to make ML usable so that small teams of non-ML experts can apply ML to their problems, achieve high-quality results, and deploy production systems that can be used in critical applications.

Key observation

our key observation is that most of the effort in industrial ML applications is not spent in devising new learning algorithms or models but is instead spent in other areas that are especially in need of better tools and infrastructure: data preparation, feature selection and extraction, and productionization.

  • Data preparation means acquiring, producing and cleaning enough training data to feed into an ML algorithm: without this quality data, ML algorithms fall flat.
  • Feature selection and extraction means identifying the data characteristics and behaviors of interest: what aspects of data are most important, and what would a domain expert implicitly or explicitly say about a given data point?
  • Productionization means deploying, monitoring and debugging a robust product: how can an organization check that the ML algorithm deployed is working, debug issues that arise, and make the system robust to changes in data?
    • In the large teams that build ML products such as Siri, most of the individuals work on data preparation, feature selection and extraction, and productionization, as well as the distributed systems infrastructure to drive these tasks at scale, not on training ML models.
    • However, thus far, these critical steps in the ML product pipeline have received far less attention than model training and new model tweaks—both from the research community and the open source software community
    • we see substantial opportunity to greatly reduce the effort required by these tasks via the development of new software tools and systems infrastructure


  • How can we enable anyone with domain expertise to build their own production-quality data products (without requiring a team of PhDs in machine learning, big data, or distributed systems, and without understanding the latest hardware)?

3 main tenets in our design philosophy:

  • target end-to-end ML workflows
    • ML-powered application development consists of far more than model training.
    • the bulk of challenges in developing new ML-powered applications are not in model training but are instead in data preparation, feature selection/extraction, and productionization (serving, monitoring, debugging, etc). *Systems should target the entire, end-to-end ML workflow.
  • Empower domain experts
    • The highest-impact ML applications will have to be developed by domain experts, not ML experts
    • today, few systems allow these domain experts to encode their domain knowledge so it can be leveraged via automation and machine learning models.
    • Systems should empower users who are not ML experts, by providing them tools for onerous tasks such as labeling, feature engineering and data augmentation.
  • Optimize end-to-end

DAWN research directions

New interfaces to ML

  • Easing model specification via observational ML (data preparation, feature engineering)
    • Can we build ML systems that learn high-quality models simply by observing domain experts?
    • By providing simple interfaces for these users to specify their beliefs about data in rule form (e.g., regular expressions), we can combine a small number of these rules and apply them to massive datasets.
  • Explaining results to humans (feature engineering, productionization)
  • Debugging and observability (feature engineering, productionization)
    • ML model “drift,” in which phenomena evolve but models do not, can be catastrophic
    • As ML models are deployed, they must be monitored and updated
    • Subsequently surfacing and correcting for deviations from expected behavior will require advances in both interfaces and model training
  • Assessing and enriching data quality (data preparation, feature engineering)
    • if we start to explicitly model the quality of each data source, then we can automatically identify the data sources that are most in need of enrichment, thus reducing the cost of data cleaning and acquisition

End-to-End ML Systems

it is possible to design end-to-end systems that encapsulate the whole ML workflow and hide internals from users

  • Classification over massive streams (data preparation, feature engineering, productionization)
  • Personalized recommendations (feature engineering, productionization)
    • Personalization is key to many popular ML-powered applications, and the literature is replete with algorithms for personalized recommendation. However, despite the simple inputs and outputs to recommendation engines, practitioners still have to build each engine from scratch, chaining together low-level algorithms and tools
  • Combining inference and actuation (feature engineering, data preparation, productionization):
    • Today, this combination of inference/prediction (i.e., predicting what will occur) and actuation/decision-making (i.e., taking action based on a prediction) is almost always performed by separate systems (often an automated inference engine and a human “decider”), except in a small handful of applications such as autonomous vehicles
    • How do we integrate actuation and decision-making as a first-class citizen in ML pipelines?
      • eg, send POST request to an automated control center
  • Unifying SQL, Graphs, and Linear Algebra (productionization)

New Substrates for ML

Training and deploying ML quickly and in a cost-effective manner requires the development of new computational substrates, from language support to distributed runtimes and accelerated hardware

  • Compilers for end-to-end optimization n (feature engineering,productionization)
    • Modern ML applications are comprised of an increasingly diverse mix of libraries and systems such as TensorFlow, Apache Spark, scikit-learn, and Pandas.
    • Even if each of these libraries is optimized in isolation, real pipelines combine multiple libraries, so production use at scale usually requires a software engineering team to rewrite the whole application in low-level code.
  • Reduced precision and inexact processing (productionization)
    • we can designing chips that are specialized for ML, operating at lower precision and allowing fabrication at high yield and execution at extremely low power.
  • Reconfigurable hardware for core kernels (feature engineering, productionization)
    • in 2017, compute is an increasingly critical bottleneck for data-hungry ML analyses, both at training time and inference time
    • Given the impending collision of CPUs and on-chip FPGAs, reconfigurable hardware with high-level programmability functionality will be increasingly important.
  • Distributed runtimes (productionization)
    • Combining ML with distributed systems is a real headache
      • is a model misbehaving because it is distributed to too many servers, or because it is poorly specified?
      • What’s the optimal amount of asynchrony?
      • What does the optimal distributed training framework really look like?
      • harnessing both intra-device (e.g., FPGA, GPU, vectorized) and inter-device (e.g., cluster compute) parallelism to consume all possible resources (i.e., automatically and dynamically offloading to different hardware within a cluster)
      • how can distributed asynchronous execution benefit us at inference time (i.e., in model serving)?
      • Can we leverage new computational substrates like serverless computing (e.g., Amazon Lambda) to further scale-out inferential procedures?
      • What is the unified programming model for distributed execution?

Our primary success metric will be usability, comprising * i) the time and cost to specify an ML application (including data sources and features of interest) * ii) the time and cost to execute the application in production (including hardware and human resources to monitor the ML models), and * iii.) the benefit to the end-user expert

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s