What is Determined AI used for?

Determined AI is used for distributed deep learning training, hyperparameter optimization, and managing GPU resources for machine learning workloads. It helps streamline the ML experiment lifecycle.

Is Determined AI open source?

Yes, Determined AI offers an open-source community edition that can be self-hosted. The enterprise version includes additional features and support.

What deep learning frameworks does Determined AI support?

Determined AI integrates with popular deep learning frameworks such as TensorFlow, Keras, and PyTorch.

How does Determined AI handle hyperparameter optimization?

Determined AI includes built-in hyperparameter optimization algorithms like asynchronous successive halving and population-based training to efficiently search for optimal model configurations.

Who owns Determined AI?

Determined AI was acquired by Hewlett Packard Enterprise (HPE) in 2021 and is now part of HPE's machine learning development system.

Can I deploy Determined AI on Kubernetes?

Yes, Determined AI can be deployed on Kubernetes clusters, leveraging Kubernetes for container orchestration and resource management.

Determined AI — Distributed Deep Learning Training Platform

Overview

Determined AI is a machine learning platform focused on accelerating deep learning training and model development. It provides an integrated environment for managing GPU clusters, orchestrating distributed training jobs, and performing hyperparameter optimization. The platform is designed to address challenges associated with scaling deep learning workloads, such as efficient resource utilization and experiment reproducibility.

The core of Determined AI's offering is its ability to abstract away much of the complexity involved in distributed training. Developers can define experiments using a Python SDK, and the platform handles the underlying infrastructure management, including data sharding, model parallelism, and fault tolerance. This allows ML engineers to focus on model architecture and data rather than operational overhead. For instance, in a distributed training setup, the platform can manage the allocation of GPUs across multiple machines and synchronize model weights, which is a common challenge in scaling deep learning workloads, as detailed in discussions on distributed training strategies by organizations like PyTorch's DistributedDataParallel documentation.

Determined AI is particularly suited for organizations and teams that conduct frequent deep learning experiments, require efficient utilization of expensive GPU resources, and need to maintain reproducibility across various model iterations. Its features support the entire ML experiment lifecycle, from initial model development and hyperparameter search to monitoring and managing deployed models. The platform offers a community edition that can be self-hosted, providing an entry point for developers to evaluate its capabilities before considering enterprise deployments. HPE acquired Determined AI in 2021, integrating its capabilities into the broader HPE Machine Learning Development System, enhancing its enterprise-grade offerings.

The platform aims to reduce the time spent on infrastructure setup and management, allowing researchers and engineers to iterate on models more quickly. By providing a structured approach to experiment tracking and resource scheduling, it helps teams avoid common pitfalls such as resource contention and inconsistent experiment environments. This is particularly relevant in environments where multiple data scientists or ML engineers share computational resources and need to track numerous experiments simultaneously.

Key features

Distributed Training: Automates the distribution of deep learning workloads across multiple GPUs and machines, supporting data parallelism and model parallelism for frameworks like TensorFlow and PyTorch.
Hyperparameter Optimization (HPO): Implements various HPO algorithms, including asynchronous successive halving and population-based training, to efficiently find optimal model configurations.
Resource Management: Provides dynamic scheduling and allocation of GPU resources across multiple users and experiments, optimizing utilization and reducing idle time.
Experiment Tracking and Reproducibility: Automatically logs experiment configurations, metrics, checkpoints, and code versions, enabling full reproducibility of results.
Model Versioning and Checkpointing: Manages model checkpoints throughout training, allowing for rollback to previous states and tracking model evolution.
Deep Learning Framework Integration: Offers native integration with popular frameworks such as TensorFlow, Keras, and PyTorch, allowing developers to use their existing codebases.
Open Source Community Edition: A self-hostable version available for evaluation and use, providing access to core features for individual developers and small teams.

Pricing

Determined AI offers a community edition for self-hosted deployments and custom enterprise pricing for its platform. The community edition provides core features suitable for individual use or smaller teams.

Product/Edition	Details	Availability	As of Date
Determined AI Community Edition	Self-hosted, open-source version with core features for distributed training and HPO.	Free	2026-05-07
Determined AI Enterprise Platform	Managed service or enterprise deployment with advanced features, support, and scalability options.	Custom pricing via sales contact	2026-05-07

For specific enterprise pricing details, contact the vendor directly via their pricing page.

Common integrations

TensorFlow: Native integration for defining and training models within the Determined AI platform.
PyTorch: Support for PyTorch models and training workflows, including distributed training.
Keras: Compatibility with Keras API for model definition and execution.
NVIDIA GPUs: Optimized for NVIDIA GPU clusters for high-performance deep learning.
Kubernetes: Can be deployed on Kubernetes clusters for container orchestration and resource management.
MLFlow: Integrates with MLFlow for experiment tracking and model management, as described in their Determined AI MLFlow integration documentation.

Alternatives

Weights & Biases: A platform for ML experiment tracking, visualization, and collaboration, focusing on MLOps.
MLflow: An open-source platform for managing the end-to-end machine learning lifecycle, including experiment tracking, reproducible runs, and model deployment.
Kubeflow: An open-source project dedicated to making deployments of machine learning workflows on Kubernetes simple, portable, and scalable.

Getting started

To begin with Determined AI, you can start by installing the client library and defining a simple experiment. The following Python code snippet demonstrates how to define a basic PyTorch model and an associated experiment configuration for training on the Determined AI platform. This example assumes you have a Determined AI cluster running and accessible.

First, ensure you have the Determined AI client library installed:

pip install determined

Next, define your model and training logic within a Python file (e.g., model_def.py) that inherits from determined.pytorch.PyTorchTrial:

import torch
import torch.nn as nn
import torch.nn.functional as F
from determined.pytorch import PyTorchTrial, PyTorchTrialContext

class MyMNISTTrial(PyTorchTrial):
    def __init__(self, context: PyTorchTrialContext):
        self.context = context
        self.model = self.context.wrap_model(Mnist_Net())
        self.optimizer = self.context.wrap_optimizer(
            torch.optim.Adam(self.model.parameters(), lr=self.context.get_hparam("learning_rate"))
        )

    def build_training_data_loader(self):
        # In a real scenario, load your dataset here
        # For simplicity, we'll use dummy data for this example
        train_dataset = torch.utils.data.TensorDataset(
            torch.randn(100, 1, 28, 28), torch.randint(0, 10, (100,))
        )
        return torch.utils.data.DataLoader(train_dataset, batch_size=self.context.get_hparam("batch_size"))

    def build_validation_data_loader(self):
        val_dataset = torch.utils.data.TensorDataset(
            torch.randn(20, 1, 28, 28), torch.randint(0, 10, (20,))
        )
        return torch.utils.data.DataLoader(val_dataset, batch_size=self.context.get_hparam("batch_size"))

    def train_batch(self, batch, epoch_idx, batch_idx):
        data, target = batch
        output = self.model(data)
        loss = F.nll_loss(output, target)
        self.context.backward(loss)
        self.context.step_optimizer(self.optimizer)
        return {"loss": loss.item()}

    def evaluate_batch(self, batch, batch_idx):
        data, target = batch
        output = self.model(data)
        loss = F.nll_loss(output, target, reduction="sum").item()
        pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
        correct = pred.eq(target.view_as(pred)).sum().item()
        return {"loss": loss, "correct": correct, "num_samples": len(data)}

    def build_model(self):
        return Mnist_Net()

class Mnist_Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output

Then, create an experiment configuration file (e.g., experiment.yaml):

name: my_mnist_experiment
project: default_project

entrypoint: model_def:MyMNISTTrial

hp_search_method:
  name: random
  num_trials: 1

max_restarts: 0
max_length:
  batches: 100

resources:
  slots_per_trial: 1

hyperparameters:
  learning_rate:
    type: double
    minval: 0.0001
    maxval: 0.1
    log_scale: true
  batch_size: 64

searcher:
  name: single_trial
  max_length:
    batches: 100

Finally, submit your experiment from the command line:

det experiment create experiment.yaml .

This command will submit the experiment to your Determined AI cluster, which will then handle the training process, resource allocation, and result tracking. You can monitor the experiment's progress through the Determined AI Web UI or CLI, as detailed in the Determined AI documentation.

Determined AI

Overview

Key features

Pricing

Common integrations

Alternatives

Getting started

Frequently asked questions

User reviews

Reader threads

Overview

Key features

Pricing

Common integrations

Alternatives

Getting started

Related

Frequently asked questions

User reviews

Reader threads