GenAI Updates AI infrastructure, AI training, cloud computing, cloud infrastructure, cluster management, container orchestration, distributed systems, GKE, Google Cloud, Kubernetes, Nvidia, scalability Mike 22. November 2025 0 Kommentare

How we built a 130,000-node GKE cluster | Google Cloud Blog

Imagine a Kubernetes cluster with 130,000 nodes, yes really. Google Cloud recently ran an experimental GKE cluster at that size, twice the officially supported limit, and the engineering choices behind it are worth your attention. Read the original post here: How we built a 130,000-node GKE cluster.

Here’s the short version, in plain talk. Scaling nodes is one thing, but you also need to scale Pod creation, scheduling throughput, storage and networking. During the test they sustained about 1,000 Pods per second at peak and stored over a million objects in an optimized distributed store. That’s wild, and it changes how you think about infrastructure.

A few technical highlights that actually matter to you:
– The API server uses a consistent, snapshottable in-memory cache (KEP-2340 and KEP-4988), so common reads don’t hammer the datastore, keeping the control plane responsive.
– Storage relies on a Spanner-backed key-value system to handle huge QPS, especially for node leases and health checks.
– Scheduling moved beyond Pod-by-Pod. Tools like Kueue offer job-level queueing and gang semantics, which makes large AI training jobs behave like coherent units, not a pile of individual Pods.
– For data, Cloud Storage FUSE with parallel downloads and Managed Lustre help feed models quickly, lowering latency by up to 70 percent in some cases.

One unexpected reality, and this stuck with me, is power. A single NVIDIA GB200 needs around 2,700 watts, so ultra-large AI clusters force you to think about power and multi-cluster distribution across data centers.

What’s next, you ask? Expect better workload-aware scheduling in core Kubernetes, improved RDMA networking (managed DRANET), and multi-cluster orchestration like MultiKueue, all making extreme scale more practical and more reliable. I’ve run clusters in the thousands, and seeing this work makes me optimistic. These innovations won’t just help hyperscalers, they’ll make everyday GKE clusters faster, more resilient and easier to manage.

Deutsch

Stell dir einen Kubernetes-Cluster mit 130.000 Knoten vor. Google Cloud hat genau das experimentell betrieben, und die Architekturentscheidungen dahinter sind spannend. Hier geht’s zum Originalartikel: How we built a 130,000-node GKE cluster.

Kurz und klar: Mehr Knoten allein reicht nicht. Du brauchst schnellere Pod-Erstellung, bessere Scheduler-Strategien, robusten Storage und leistungsfähiges Networking. In den Tests wurden rund 1.000 Pods pro Sekunde erreicht, und über eine Million Objekte wurden im verteilten Storage verwaltet.

Wesentliche Punkte:
– Der API-Server nutzt einen konsistenten, snapshottbaren In-Memory-Cache (KEP-2340, KEP-4988), so werden viele Lesezugriffe aus dem Speicher bedient und die zentrale Datenbank wird entlastet.
– Ein Spanner-basiertes Key-Value-Backend sorgt für hohe QPS und stabilen Betrieb, etwa bei Lease-Updates für Node-Health-Checks.
– Scheduling verschiebt sich von Pod- zu Workload-orientiert. Kueue bringt Job-Queueing und Gang-Scheduling, was besonders für AI-Trainingsjobs einen großen Vorteil darstellt.
– Für Datenzugriff helfen Cloud Storage FUSE mit parallelen Downloads und Managed Lustre, die Latenz deutlich zu reduzieren.

Ein Punkt, den ich so nicht erwartet habe, ist die Stromfrage. Moderne GPUs brauchen viel Leistung, also wird Multi-Cluster-Verteilung über Rechenzentren fast zwingend, wenn du jenseits großer Skalen arbeitest.

Ausblick: Bessere workload-aware Scheduler, Managed RDMA (DRANET) und MultiKueue werden kommen, das macht extreme Skalierung praktikabler. Ich finde das ermutigend, weil diese Investitionen auch normale GKE-Cluster robuster und performanter machen.