AI and Machine Learning | Unreal Containers

Here's what you'll need:

An Unreal Engine runtime image with support for GPU acceleration
An environment configured for running containers with GPU acceleration

Overview
Key considerations
Implementation guidelines
- Choosing a communication mechanism
- Deployment strategies

Overview

Generation of training data for machine learning models is the single most common use of the Unreal Engine within the context of scientific research. The algorithms underlying these models typically rely on GPU-based computation to achieve maximum performance, and GPU accelerated containers are widely used for running machine learning workloads in the cloud. Unreal Engine containers allow simulations to be packaged and deployed alongside the machine learning models that interact with them, using the same familiar technologies and deployment pipeline. Container orchestration frameworks such as Kubernetes can be used to facilitate network-based or IPC-based communication between containers and to perform training or inference at scale.

Key considerations

You will always need a container image with GPU acceleration support for training machine learning models, but whether you need GPU acceleration for your Unreal Engine simulations depends on whether they perform rendering in order to transmit image data to a model. If your simulations are not performing rendering then you can start them in headless mode by specifying the -nullrhi command-line flag, which will allow them to run in container images without support for GPU acceleration.
The choice of runtime image for your containers will depend on your deployment strategy. See the Deployment strategies section for a discussion of the relevant base image requirements.
It is strongly recommended that you use Linux containers for machine learning workloads, since container orchestration systems such as Kubernetes do not yet support GPU accelerated Windows containers.

Implementation guidelines

Choosing a communication mechanism

There are a number of mechanisms by which Unreal Engine simulations can interact with software that encapsulates machine learning models. The choice of communication mechanism dictates the manner in which both the simulation and model can be packaged and deployed in containers, so developers should consider this carefully when designing new simulations or preparing existing simulations for containerisation.

Network-based communication

Network-based communication is by far the most flexible approach, since it allows the simulation and the model to communicate across different containers or even different underlying host systems. Socket-based network communication is supported natively by the Unreal Engine without the need to integrate additional third-party libraries. If you do decide to integrate additional communication middleware then the use of an RPC framework will allow you to design your system using a standard microservices architecture and leverage microservice-oriented features of container orchestration frameworks such as Kubernetes.

IPC-based communication

IPC-based communication mechanisms such as shared memory can provide better performance than network-based communication when transmitting large quantities of data, albeit at the cost of reduced flexibility. Simulations and models communicating this way must be located on the same underlying host system, but they can still be packaged in separate containers that share an IPC namespace via a grouping mechanism such as a Kubernetes Pod. The Unreal Engine includes native support for named shared memory, but care must be taken to match the platform-specific implementation details when accessing shared memory in the model software to ensure full compatibility.

In-process communication

In-process communication is by far the least flexible and most brittle approach. Not only does in-process communication force the simulation and the model to run inside the same container, it also introduces significant complexities surrounding the integration of the model software into the Unreal Engine. This may involve the integration of third-party libraries and frameworks or even interpreters for complete programming languages. In most cases any performance benefits associated with this approach do not provide sufficient value to outweigh the cost of the engineering effort required to implement and maintain the integration, and as such this approach is not recommended.

Deployment strategies

Separate containers, loosely coupled

Supported communication mechanisms: network-based

Unsupported communication mechanisms: IPC-based, in-process

In this strategy, the simulation and the machine learning model are deployed in separate containers that are not grouped together in any way. This necessitates network-based communication, since the containers may be scheduled on different underlying host systems. The containers use network discovery to identify one another and typically operate in a client-server model.

If the simulation is not performing rendering then its container can use a runtime image without support for GPU acceleration and can be run on a CPU-only host system, whilst the model container uses a CUDA or OpenCL equipped base image and runs on a host system with one or more GPUs attached. If the simulation is performing rendering then its container will need to use a base image with OpenGL or Vulkan support and run on a GPU-equipped host system.

This strategy is well-suited to scenarios where there exists a one-to-many relationship between a single simulation and multiple machine learning models, such as when multiple autonomous agents are interacting in a single shared virtual environment. Note that this strategy is not well-suited to scenarios where multiple agents each require rendered frames from a unique camera, since the GPUs attached to the simulation container will quickly become a bottleneck as the number of connected agents increases. In such scenarios, it is better to run an Unreal Engine dedicated server to coordinate shared state and have it communicate with multiple sets of paired containers that each tightly couple an agent with an Unreal Engine client that performs rendering on its behalf.

Separate containers, tightly coupled

Supported communication mechanisms: network-based, IPC-based

Unsupported communication mechanisms: in-process

In this strategy, the simulation and the machine learning model are deployed in separate containers that are grouped together using a mechanism such as a Kubernetes Pod. This ensures the containers will be scheduled on the same underlying host system and allows them to share their network and IPC namespaces, facilitating both network-based and IPC-based communication.

The container base image requirements for this strategy are the same as those for the loosely coupled strategy described above. Although both the simulation and the model will run together on a GPU-equipped host system, the size of the container image for the simulation can still be kept to a minimum by excluding GPU acceleration support if the simulation does not perform rendering.

This strategy is well-suited to scenarios where there exists a one-to-one relationship between simulations and machine learning models, or when each model is coupled with an Unreal Engine client that communicates with a single Unreal Engine dedicated server that coordinates shared state for a simulation.

Single container

Supported communication mechanisms: network-based, IPC-based, in-process

In this strategy, the simulation and the machine learning model are deployed together in a single container. Because the processes are running in the same container on the same underlying host system, all forms of communication are supported. However, this strategy also imposes a number of limitations that do not exist when using separate containers:

This forces a one-to-one relationship between simulation instances and model instances. One-to-many relationships are not supported.
This violates the guideline that containers should each encapsulate a single concern, which is widely accepted as an industry best practice.
Modifications to either the simulation or the machine learning model will necessitate a rebuild of the single shared container image.
If the simulation is performing rendering then the container base image will need to support both OpenGL and CUDA/OpenCL. If the simulation does not perform rendering then a CUDA or OpenCL equipped base image will be sufficient.

Contents