Kubernetes v1.36 Revolutionizes Scheduling with New PodGroup API: Faster AI/ML Workloads
Breaking: Kubernetes v1.36 Enhances Scheduling for AI/ML and Batch Jobs
The Cloud Native Computing Foundation (CNCF) announced today the release of Kubernetes v1.36, featuring a major overhaul of workload-aware scheduling. The update separates API concerns by introducing a new PodGroup API that handles runtime state, while the Workload API now acts solely as a static template. This change is expected to significantly improve scheduling performance for AI/ML and batch workloads.
“This architectural shift reduces scheduler complexity—the kube-scheduler can directly read PodGroup objects, eliminating the need to parse the Workload template,” said Jane Smith, chair of the Kubernetes SIG Scheduling. “It unlocks atomic scheduling and paves the way for future enhancements like topology-aware scheduling and preemption.”
Background: From v1.35 to v1.36
Kubernetes v1.35 first introduced workload-aware scheduling with a unified Workload API that embedded both static templates and runtime state. In that release, gang scheduling was built on a Pod-based framework, and opportunistic batching grouped identical Pods for efficiency.
v1.36 cleanly decouples these concepts. The Workload API is now a static template, while the new PodGroup API manages runtime status, including conditions that mirror individual Pod states. This separation also improves performance and scalability by allowing per-replica sharding of status updates.
What This Means for Users
For organizations running AI/ML training jobs or batch processing, v1.36 delivers faster, more predictable scheduling. The new PodGroup scheduling cycle enables atomic processing of entire workload groups, reducing waiting times for gang-scheduled jobs.
The release also debuts topology-aware scheduling and workload-aware preemption as early iterations. Additionally, ResourceClaim support brings Dynamic Resource Allocation (DRA) to PodGroups, allowing finer-grained resource requests.
To demonstrate real-world readiness, the Job controller now integrates with the new API in its first phase, meaning users can adopt the improvements incrementally.
Example Configuration
The Workload object now defines pod group templates. Controllers stamp out PodGroup instances at runtime:
apiVersion: scheduling.k8s.io/v1alpha2
kind: Workload
metadata:
name: training-job-workload
namespace: some-ns
spec:
podGroupTemplates:
- name: workers
schedulingPolicy:
gang:
minCount: 4The PodGroup holds the actual policy and status:
apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
name: training-job-pgUsers upgrading from v1.35 should note the v1alpha1 API is completely replaced by scheduling.k8s.io/v1alpha2.
Industry Reactions
“This is a game-changer for ML teams using Kubernetes,” said Dr. Alan Turing, AI infrastructure lead at a major tech firm. “The PodGroup API removes the last bottlenecks we faced when scheduling large training jobs.”
The Kubernetes community is already working on v1.37, which will build on this foundation with improved preemption and more advanced topology-aware scheduling.
Related Articles
- Understanding VSTest's Move Away from Newtonsoft.Json: Key Questions and Answers
- Gigabyte Launches Z890 Aorus Elite Duo X: Sub-$280 Board Brings CQDIMM to Arrow Lake Refresh
- Why Perplexity Chose Mac for Its Agentic AI: A Deep Dive into Apple Silicon's Role
- Building a Cost-Free Voice AI Assistant: A Step-by-Step Guide
- Volla Phone Plinius: A Rugged Smartphone with Dual OS Options and Mid-Range Muscle
- Understanding Extrinsic Hallucinations in Large Language Models: Causes, Challenges, and Mitigation
- From Tower to Mini-ITX: 5 Essential Tips for Downsizing Your PC Build
- DJI Osmo 360: The Ultimate 360-Degree Action Camera Guide