Use Cases:

AI and Machine Learning

Run Unreal Engine simulations alongside machine learning workloads in the cloud.

Contents

Overview

Generation of training data for machine learning models is the single most common use of the Unreal Engine within the context of scientific research. The algorithms underlying these models typically rely on GPU-based computation to achieve maximum performance, and GPU-accelerated containers powered by the NVIDIA Docker runtime are widely used for running machine learning workloads in the cloud. Unreal Engine containers allow simulations to be packaged and deployed alongside the machine learning models that interact with them, using the same familiar technologies and deployment pipeline. Container orchestration frameworks such as Kubernetes can be used to facilitate network-based or IPC-based communication between containers and to perform training or inference at scale.

Key considerations

Implementation guidelines

Choosing a communication mechanism

There are a number of mechanisms by which Unreal Engine simulations can interact with software that encapsulates machine learning models. The choice of communication mechanism dictates the manner in which both the simulation and model can be packaged and deployed in containers, so developers should consider this carefully when designing new simulations or preparing existing simulations for containerisation.

Network-based communication

Network-based communication is by far the most flexible approach, since it allows the simulation and the model to communicate across different containers or even different underlying host systems. Socket-based network communication is supported natively by the Unreal Engine without the need to integrate additional third-party libraries. If you do decide to integrate additional communication middleware then the use of an RPC framework will allow you to design your system using a standard microservices architecture and leverage microservice-oriented features of container orchestration frameworks such as Kubernetes.

IPC-based communication

IPC-based communication mechanisms such as shared memory can provide better performance than network-based communication when transmitting large quantities of data, albeit at the cost of reduced flexibility. Simulations and models communicating this way must be located on the same underlying host system, but they can still be packaged in separate containers that share an IPC namespace via a grouping mechanism such as a Kubernetes Pod. The Unreal Engine includes native support for named shared memory, but care must be taken to match the platform-specific implementation details when accessing shared memory in the model software to ensure full compatibility.

In-process communication

In-process communication is by far the least flexible and most brittle approach. Not only does in-process communication force the simulation and the model to run inside the same container, it also introduces significant complexities surrounding the integration of the model software into the Unreal Engine. This may involve the integration of third-party libraries and frameworks or even interpreters for complete programming languages. In most cases any performance benefits associated with this approach do not provide sufficient value to outweigh the cost of the engineering effort required to implement and maintain the integration, and as such this approach is not recommended.

Deployment strategies

Separate containers, loosely coupled

In this strategy, the simulation and the machine learning model are deployed in separate containers that are not grouped together in any way. This necessitates network-based communication, since the containers may be scheduled on different underlying host systems. The containers use network discovery to identify one another and typically operate in a client-server model.

If the simulation is not performing rendering then its container can use a base image without support for GPU acceleration and can be run on a CPU-only host system, whilst the model container uses a CUDA or OpenCL equipped base image and runs on a host system with one or more GPUs attached. If the simulation is performing rendering then its container will need to use a base image with OpenGL support and run on a GPU-equipped host system.

This strategy is well-suited to scenarios where there exists a one-to-many relationship between a single simulation and multiple machine learning models, such as when multiple autonomous agents are interacting in a single shared virtual environment. Note that this strategy is not well-suited to scenarios where multiple agents each require rendered frames from a unique camera, since the GPUs attached to the simulation container will quickly become a bottleneck as the number of connected agents increases. In such scenarios, it is better to run an Unreal Engine dedicated server to coordinate shared state and have it communicate with multiple sets of paired containers that each tightly couple an agent with an Unreal Engine client that performs rendering on its behalf.

Separate containers, tightly coupled

In this strategy, the simulation and the machine learning model are deployed in separate containers that are grouped together using a mechanism such as a Kubernetes Pod. This ensures the containers will be scheduled on the same underlying host system and allows them to share their network and IPC namespaces, facilitating both network-based and IPC-based communication.

The container base image requirements for this strategy are the same as those for the loosely coupled strategy described above. Although both the simulation and the model will run together on a GPU-equipped host system, the size of the container image for the simulation can still be kept to a minimum by excluding GPU acceleration support if the simulation does not perform rendering.

This strategy is well-suited to scenarios where there exists a one-to-one relationship between simulations and machine learning models, or when each model is coupled with an Unreal Engine client that communicates with a single Unreal Engine dedicated server that coordinates shared state for a simulation.

Single container

In this strategy, the simulation and the machine learning model are deployed together in a single container. Because the processes are running in the same container on the same underlying host system, all forms of communication are supported. However, this strategy also imposes a number of limitations that do not exist when using separate containers: