Click to learn more about author Ion Stoica.
Machine learning platform designers
need to meet current challenges and plan for future workloads.
As machine learning gains a foothold in more and more companies, teams are struggling with the intricacies of managing the machine learning lifecycle.
The typical starting point is to
give each data scientist a Jupyter notebook backed by a GPU instance in the
cloud and to have a separate team manage deployment and serving, but this approach
breaks down as the complexity of the applications and the number of deployments
As a result, more teams are looking for machine learning platforms. Several startups and cloud providers are beginning to offer end-to-end machine learning platforms, including AWS (SageMaker), Azure (Machine Learning Studio), Databricks (MLflow), Google (Cloud AI Platform), and others. Many other companies choose to build their own, including Uber (Michelangelo), Airbnb (BigHead), Facebook (FBLearner), Netflix, and Apple (Overton).
Many users are building ML tools and platforms. This is an active space: Several startups and cloud platforms are beginning to offer end-to-end machine learning platforms, including Amazon, Databricks, Google, and Microsoft. But conversations with people in the community keep driving users back to this tool to solve many challenges left unsolved.
Ray’s collection of libraries can be used alongside other ML platform components. And unlike monolithic platforms, users have the flexibility to use one or more of the existing libraries or can use it to build their own libraries. A partial list of machine learning platforms that already incorporate such features include:
In this post, we will share insights
derived from conversations with many ML platform builders. We will list
features that will be critical to ensuring that your ML platform is
well-positioned for modern AI applications. We will also make the case that
developers building ML platforms and ML components should consider such a
platform because it has an ecosystem of standalone libraries that can
be used to address some of the items we list below.
Developers and machine learning engineers use a variety of tools and programming languages (R, Python, Julia, SAS, etc.). But with the rise of deep learning, Python has become the dominant programming language for machine learning. So if anything, an ML platform needs to support Python and the Python ecosystem.
As a practical matter, developers and machine learning engineers rely on many different Python libraries and tools. The most widely used libraries include deep learning tools (TensorFlow, PyTorch), machine learning and statistical modeling libraries (scikit-learn, statsmodels), NLP tools (spaCy, Hugging Face, AllenNLP), and model tuning (Hyperopt, Tune).
Because it integrates seamlessly in
the Python ecosystem, many developers are leveraging Ray for building machine
learning tools. It is a general-purpose distributed computing platform that can
be used to easily scale existing Python libraries and applications. It also has a growing collection of standalone
libraries available to Python
As we noted in The Future of Computing is Distributed, the “demands of machine learning applications are increasing at a breakneck speed.” The rise of deep learning and new workloads means that distributed computing will be common for machine learning. Unfortunately, many developers have relatively little experience in distributed computing.
Scaling and distributed computation
are areas where Ray has been helpful to many users we’ve spoken with. It allows
developers to focus on their applications instead of on the intricacies of
distributed computing. Using it brings several benefits to developers needing
to scale machine learning applications:
- It’s a platform that lets you easily scale your existing applications to a cluster. This could be as simple as scaling your favorite library to a compute cluster (see this recent post on using it to scale scikit-learn). Or it could involve using the API to scale an existing program to a cluster. The latter scenario is one we’ve seen happen for applications in NLP, online learning, fraud detection, financial time-series, OCR, and many other use cases.
- RaySGD simplifies distributed training for PyTorch and TensorFlow. This is good news for the many companies and developers who struggle with training or tuning large neural networks.
- Instead of spending time on DevOps, a built-in cluster launcher makes it simple to set up a cluster.
Extensibility for New Workloads
Modern AI platforms are notoriously compute hungry. In The Future of Computing is Distributed post referenced above, we noted that model tuning is an important part of the machine learning development process:
“You don’t train a model just once. Typically, the quality of a model depends on a variety of hyperparameters, such as the number of layers, the number of hidden units, and the batch size. Finding the best model often requires searching among various hyperparameter settings. This process is called hyperparameter tuning, and it can be very expensive.” Ion Stoica
Developers can choose from several libraries for tuning models. One of the more popular tools is Tune, a scalable hyperparameter tuning library built on top of Ray. Tune runs on a single node or on a cluster and has quickly become one of the more popular libraries in this ecosystem.
Reinforcement learning (RL) is another area worth highlighting. Many of the recent articles about RL pertain to gameplay (Atari, Go, multiplayer video games) or to applications in industrial settings (e.g., data center efficiency). But, as we’ve previously noted, there are emerging applications in recommendations and personalization, simulation and optimization, financial time series, and public policy.
RL is compute-intensive, complex to implement and scale, and, as such, many developers will want to simply use libraries. Ray provides a simple, highly scalable library (RLlib) that developers and machine learning engineers across several organizations are already using in production.
Tools Designed for Teams
As companies begin to use and deploy
more machine learning models, teams of developers will need to be able to
collaborate with each other. They will need access to platforms that
enable both sharing and discovery. When considering an ML platform,
consider the key stages of model development and operations, and assume
that teams of people with different backgrounds will collaborate during each of
For example, feature stores (first introduced by Uber in 2017) are useful because they allow developers to share and discover features that they might otherwise not have thought about. Teams also need to be able to collaborate during the model development lifecycle. This includes managing, tracking, and reproducing experiments. The leading open source project in this area is MLflow, but we’ve come across users of other tools like Weights & Biases and Comet, as well as users who have built their own tools to manage ML experiments.
Enterprises will require additional
features — including security and access control. Model governance and model
catalogs (analogs of similar systems for managing data) are also going to be
required as teams of developers build and deploy more models.
First Class MLOps
MLOps is a set of practices focused on productionizing the machine learning lifecycle. It is a relatively new term that draws ideas from continuous integration (CI) and continuous deployment (CD), two widely used software development practices. A recent Thoughtworks post listed some key considerations for establishing CD for machine learning (CD4ML). Some key items for CI/CD for machine learning include reproducibility, experiment management and tracking, model monitoring and observability, and more. There are a few startups and open source projects that offer MLOps solutions, including Datatron, Verta, TFX, and MLflow.
Ray has components that would be
useful for companies moving towards CI/CD or building CI/CD tools for machine
learning. It already has libraries for key stages of the ML lifecycle: training (RaySGD), tuning (Tune),
and deployment (Serve). Having access to libraries that work
seamlessly together will allow users to more readily bring CI/CD methods into
their MLOps practice.
We listed key elements that machine
learning platforms should possess. Our baseline assumptions are that Python
will remain the language of choice for ML, distributed computing will increasingly
be needed, new workloads like hyperparameter tuning and RL will need to be
supported, and tools for enabling collaboration and MLOps need to be available.
With these assumptions in mind, we
believe that Ray will be the foundation of future ML platforms. Many developers
and engineers are already using it for their machine learning applications,
including training, optimization, and serving. It not only addresses current
challenges like scale and performance but is well-positioned to support future
workloads and data types. Ray, and the growing number of libraries built on top
of it, will help scale ML platforms for years to come. We look forward to
seeing what you build.
Credit: Source link