Overview

Run:AI provides an orchestration and management platform specifically engineered for AI and machine learning workloads, aiming to address the challenges of efficiently utilizing expensive GPU resources in shared environments. The platform integrates with existing Kubernetes clusters, abstracting the underlying hardware to offer capabilities such as GPU virtualization, dynamic resource allocation, and intelligent job scheduling. This approach is designed to reduce GPU idle time, improve throughput for ML training and inference, and simplify the operational overhead for MLOps teams and data scientists.

The core offering, the Run:AI Atlas Platform, targets enterprises and research institutions that manage significant GPU infrastructure for AI development. It enables multiple teams and users to share a common pool of GPUs without manual resource partitioning, helping to prevent bottlenecks and improve collaboration. By virtualizing GPUs, Run:AI allows fractional GPU usage, enabling multiple smaller workloads to run concurrently on a single physical GPU, or conversely, allowing a single large job to access multiple GPUs or even fractions of GPUs as needed. This flexibility is critical for organizations running diverse ML experiments, from hyperparameter tuning to large-scale model training.

For developers, Run:AI aims to provide a self-service experience for requesting and managing computational resources, integrating with common ML frameworks and tools. It supports various workload types, including interactive sessions, batch jobs, and distributed training. The platform also offers visibility into resource consumption and cluster performance through dashboards and reporting, which assists MLOps engineers in monitoring and optimizing their AI infrastructure. The acquisition of Run:AI by NVIDIA in 2024 further integrates its capabilities within the NVIDIA ecosystem, aligning with NVIDIA's strategy to provide end-to-end solutions for accelerated computing and AI development.

Key features

  • GPU Virtualization: Allows multiple users or jobs to share a single physical GPU, or a single job to utilize fractions of multiple GPUs, optimizing hardware utilization.
  • Dynamic Job Scheduling: Intelligently schedules AI workloads based on resource availability, priority, and defined policies to maximize throughput and minimize queue times.
  • Resource Management Policies: Enables administrators to set quotas, priorities, and access controls for different teams, users, and projects within a shared GPU cluster.
  • Kubernetes Integration: Operates as an extension to Kubernetes, leveraging its orchestration capabilities for containerized ML workloads while adding specialized GPU management.
  • Workload Management: Supports various AI workload types, including interactive development environments, batch training jobs, and distributed model training.
  • Visibility and Monitoring: Provides dashboards and reporting tools to track GPU utilization, job status, resource consumption, and overall cluster performance.
  • Framework Agnostic: Compatible with popular deep learning frameworks such as TensorFlow, PyTorch, and JAX, as well as MLOps tools.
  • Multi-Tenancy: Facilitates secure and isolated environments for multiple teams or departments sharing the same underlying GPU infrastructure.

Pricing

Run:AI offers custom enterprise pricing for its Atlas Platform, tailored to the specific needs and scale of an organization's AI infrastructure. Prospective customers typically engage directly with Run:AI's sales team to discuss their requirements and obtain a personalized quote.

Product / Service Pricing Model Details
Run:AI Atlas Platform Custom Enterprise Pricing Tailored based on factors such as the number of GPUs managed, scale of deployment, and required features. Contact sales for a quote (as of 2026-05-07).

Common integrations

  • Kubernetes: Run:AI integrates directly with existing Kubernetes clusters to manage and orchestrate GPU resources. For details, refer to the Run:AI Kubernetes prerequisites documentation.
  • ML Frameworks (TensorFlow, PyTorch, JAX): The platform supports workloads built with common deep learning frameworks, allowing developers to submit jobs without modifying their code extensively.
  • Container Registries: Integrates with container registries like Docker Hub or private registries for pulling Docker images containing ML environments and models.
  • Version Control Systems (Git): Often used in conjunction with Git for managing ML code and experiment configurations.
  • MLOps Tools: Can be integrated into broader MLOps pipelines alongside tools for experiment tracking, data versioning, and model deployment, such as Weights & Biases for experiment tracking.

Alternatives

  • Kubeflow: An open-source project dedicated to making deployments of machine learning workflows on Kubernetes simple, portable, and scalable.
  • Weights & Biases: A MLOps platform that provides tools for experiment tracking, model versioning, and dataset management, often used alongside resource orchestrators.
  • Domino Data Lab: An enterprise MLOps platform that provides an environment for data scientists to build, train, deploy, and manage models at scale, often including resource management capabilities.

Getting started

Getting started with Run:AI typically involves deploying its components onto an existing Kubernetes cluster and then submitting an ML workload. The following example demonstrates a basic PyTorch training job submission using the Run:AI CLI, assuming the Run:AI control plane and CLI are already configured.

# 1. Log in to Run:AI (if not already logged in)
runai login

# 2. Define your PyTorch training script (e.g., train.py)
#    This script would typically be packaged into a Docker image.
#    Example train.py (simplified):
#    import torch
#    import torch.nn as nn
#    import torch.optim as optim
#    # ... model definition, data loading, training loop ...
#    print("PyTorch training job started!")
#    model = nn.Linear(10, 1)
#    optimizer = optim.SGD(model.parameters(), lr=0.01)
#    loss_fn = nn.MSELoss()
#    # Simulate some training steps
#    for i in range(5):
#        dummy_input = torch.randn(1, 10)
#        dummy_target = torch.randn(1, 1)
#        output = model(dummy_input)
#        loss = loss_fn(output, dummy_target)
#        optimizer.zero_grad()
#        loss.backward()
#        optimizer.step()
#        print(f"Step {i+1}, Loss: {loss.item():.4f}")
#    print("PyTorch training job finished!")

# 3. Submit a training job to Run:AI
#    This command assumes your training script is in a Docker image
#    (e.g., 'your_registry/your_image:latest') and requires 1 GPU.
#    The '--interactive' flag allows you to see logs directly.

runai submit \
  --name my-pytorch-train-job \
  --image your_registry/your_image:latest \
  --gpu 1 \
  --command "python train.py" \
  --interactive

# Expected output (example):
# Job my-pytorch-train-job submitted successfully.
# Waiting for job to start...
# ... (logs from your train.py script will appear here) ...
# PyTorch training job started!
# Step 1, Loss: X.XXXX
# Step 2, Loss: X.XXXX
# ...
# PyTorch training job finished!

# 4. Check job status (in another terminal)
runai list jobs

# 5. View logs of a specific job
runai logs my-pytorch-train-job

# 6. Delete the job when no longer needed
runai delete job my-pytorch-train-job

This example illustrates submitting a single-GPU PyTorch training job. Run:AI also supports multi-GPU and distributed training configurations, which can be specified using additional CLI parameters or through YAML manifests for more complex deployments. The --image flag points to a Docker image containing your application code and dependencies, which is a standard practice for deploying containerized ML workloads on Kubernetes-based platforms.