Ray: Core Architecture

Series Outline

Note: this is part of an ongoing series of posts about Ray. The full series is:

  1. Ray: A Distributed Computing Platform For Machine Learning
  2. Ray’s Ecosystem
  3. This post, where we start getting into technical details


Ray and Its Programming Model
Ray, an open source project which originated at UC Berkeley, provides a high level programming model and supporting infrastructure to support distributed machine learning applications.

For the application programmer, Ray provides two key abstractions: Tasks, which are stateless functions in the serveless model, and Actors, which are stateful objects. Actors can be called multiple times and references to an actor can be passed between Actors and Tasks, permitting complex and stateful communication patterns.

Deployment Options for Ray Clusters
Ray process can be spawned on a single machine (e.g. your laptop), a manually configured cluster, a dynamically provisioned cluster in the cloud, or on top of another orchestration layer, including Kubernetes, Hadoop’s Yarn, and Slurm.

Key Components

The picture above shows the key components of Ray. A Ray cluster consists of a single Head Node and multiple Worker Nodes. When running on a laptop, you will only have a Head Node. The Head Node manages the Global Cluster Store, a Redis database storing cluster metadata. The driver is the user program which launches and coordinates the user’s tasks and actors. It typically also runs on the head node.


Each node, whether the head or a worker, runs a Raylet, a process which schedules and coordinates the activities of that node. Worker processes on each node run the actual Actors and Tasks. In addition, each node has an Object Store, a shared memory space managed by the Raylet. This is used for communication between the workers of a given node and is also available to Ray programs as a distributed key/value store. 



Once the Actors and Tasks of a Ray application have been started by the Ray infrastructure, the Driver, Actors, and Tasks communicate directly with each other. This is done using the gPRC protocol and the Object Store. The details of interprocess communication are hidden from the user — via the Ray API, these asynchronous calls look like local function or method calls. Under the covers, the send code in Ray will determine the mechanism to reach the destination (shared memory or gRPC), serialize the parameters, and send the message.


Coming Soon

In my next post, I will look at how Ray improves upon Serverless Computing. Stay tuned for more!