Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now
The past few decades have seen almost unimaginable advances in compute performance and efficiency, enabled by Moore’s Law and underpinned by scale-out commodity hardware and loosely coupled software. This architecture has delivered online services to billions globally and put virtually all of human knowledge at our fingertips.
But the next computing revolution will demand much more. Fulfilling the promise of AI requires a step-change in capabilities far exceeding the advancements of the internet era. To achieve this, we as an industry must revisit some of the foundations that drove the previous transformation and innovate collectively to rethink the entire technology stack. Let’s explore the forces driving this upheaval and lay out what this architecture must look like.
From commodity hardware to specialized compute
For decades, the dominant trend in computing has been the democratization of compute through scale-out architectures built on nearly identical, commodity servers. This uniformity allowed for flexible workload placement and efficient resource utilization. The demands of gen AI, heavily reliant on predictable mathematical operations on massive datasets, are reversing this trend.
We are now witnessing a decisive shift towards specialized hardware — including ASICs, GPUs, and tensor processing units (TPUs) — that deliver orders of magnitude improvements in performance per dollar and per watt compared to general-purpose CPUs. This proliferation of domain-specific compute units, optimized for narrower tasks, will be critical to driving the continued rapid advances in AI.
The AI Impact Series Returns to San Francisco – August 5
The next phase of AI is here – are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows – from real-time decision-making to end-to-end automation.
Secure your spot now – space is limited: https://bit.ly/3GuuPLF
Beyond ethernet: The rise of specialized interconnects
These specialized systems will often require “all-to-all” communication, with terabit-per-second bandwidth and nanosecond latencies that approach local memory speeds. Today’s networks, largely based on commodity Ethernet switches and TCP/IP protocols, are ill-equipped to handle these extreme demands.
As a result, to scale gen AI workloads across vast clusters of specialized accelerators, we are seeing the rise of specialized interconnects, such as ICI for TPUs and NVLink for GPUs. These purpose-built networks prioritize direct memory-to-memory transfers and use dedicated hardware to speed information sharing among processors, effectively bypassing the overhead of traditional, layered networking stacks.
This move towards tightly integrated, compute-centric networking will be essential to overcoming communication bottlenecks and scaling the next generation of AI efficiently.
Breaking the memory wall
For decades, the performance gains in computation have outpaced the growth in memory bandwidth. While techniques like caching and stacked SRAM have partially mitigated this, the data-intensive nature of AI is only exacerbating the problem.
The insatiable need to feed increasingly powerful compute units has led to high bandwidth memory (HBM), which stacks DRAM directly on the processor package to boost bandwidth and reduce latency. However, even HBM faces fundamental limitations: The physical chip perimeter restricts total dataflow, and moving massive datasets at terabit speeds creates significant energy constraints.
These limitations highlight the critical need for higher-bandwidth connectivity and underscore the urgency for breakthroughs in processing and memory architecture. Without these innovations, our powerful compute resources will sit idle waiting for data, dramatically limiting efficiency and scale.
From server farms to high-density systems
Today’s advanced machine learning (ML) models often rely on carefully orchestrated calculations across tens to hundreds of thousands of identical compute elements, consuming immense power. This tight coupling and fine-grained synchronization at the microsecond level imposes new demands. Unlike systems that embrace heterogeneity, ML computations require homogeneous elements; mixing generations would bottleneck faster units. Communication pathways must also be pre-planned and highly efficient, since delays in a single element can stall an entire process.
These extreme demands for coordination and power are driving the need for unprecedented compute density. Minimizing the physical distance between processors becomes essential to reduce latency and power consumption, paving the way for a new class of ultra-dense AI systems.
This drive for extreme density and tightly coordinated computation fundamentally alters the optimal design for infrastructure, demanding a radical rethinking of physical layouts and dynamic power management to prevent performance bottlenecks and maximize efficiency.
A new approach to fault tolerance
Traditional fault tolerance relies on redundancy among loosely connected systems to achieve high uptime. ML computing demands a different approach.
First, the sheer scale of computation makes over-provisioning too costly. Second, model training is a tightly synchronized process, where a single failure can cascade to thousands of processors. Finally, advanced ML hardware often pushes to the boundary of current technology, potentially leading to higher failure rates.
Instead, the emerging strategy involves frequent checkpointing — saving computation state — coupled with real-time monitoring, rapid allocation of spare resources and quick restarts. The underlying hardware and network design must enable swift failure detection and seamless component replacement to maintain performance.
A more sustainable approach to power
Today and looking forward, access to power is a key bottleneck for scaling AI compute. While traditional system design focuses on maximum performance per chip, we must shift to an end-to-end design focused on delivered, at-scale performance per watt. This approach is vital because it considers all system components — compute, network, memory, power delivery, cooling and fault tolerance — working together seamlessly to sustain performance. Optimizing components in isolation severely limits overall system efficiency.
As we push for greater performance, individual chips require more power, often exceeding the cooling capacity of traditional air-cooled data centers. This necessitates a shift towards more energy-intensive, but ultimately more efficient, liquid cooling solutions, and a fundamental redesign of data center cooling infrastructure.
Beyond cooling, conventional redundant power sources, like dual utility feeds and diesel generators, create substantial financial costs and slow capacity delivery. Instead, we must combine diverse power sources and storage at multi-gigawatt scale, managed by real-time microgrid controllers. By leveraging AI workload flexibility and geographic distribution, we can deliver more capability without expensive backup systems needed only a few hours per year.
This evolving power model enables real-time response to power availability — from shutting down computations during shortages to advanced techniques like frequency scaling for workloads that can tolerate reduced performance. All of this requires real-time telemetry and actuation at levels not currently available.
Security and privacy: Baked in, not bolted on
A critical lesson from the internet era is that security and privacy cannot be effectively bolted onto an existing architecture. Threats from bad actors will only grow more sophisticated, requiring protections for user data and proprietary intellectual property to be built into the fabric of the ML infrastructure. One important observation is that AI will, in the end, enhance attacker capabilities. This, in turn, means that we must ensure that AI simultaneously supercharges our defenses.
This includes end-to-end data encryption, robust data lineage tracking with verifiable access logs, hardware-enforced security boundaries to protect sensitive computations and sophisticated key management systems. Integrating these safeguards from the ground up will be essential for protecting users and maintaining their trust. Real-time monitoring of what will likely be petabits/sec of telemetry and logging will be key to identifying and neutralizing needle-in-the-haystack attack vectors, including those coming from insider threats.
Speed as a strategic imperative
The rhythm of hardware upgrades has shifted dramatically. Unlike the incremental rack-by-rack evolution of traditional infrastructure, deploying ML supercomputers requires a fundamentally different approach. This is because ML compute does not easily run on heterogeneous deployments; the compute code, algorithms and compiler must be specifically tuned to each new hardware generation to fully leverage its capabilities. The rate of innovation is also unprecedented, often delivering a factor of two or more in performance year over year from new hardware.
Therefore, instead of incremental upgrades, a massive and simultaneous rollout of homogeneous hardware, often across entire data centers, is now required. With annual hardware refreshes delivering integer-factor performance improvements, the ability to rapidly stand up these colossal AI engines is paramount.
The goal must be to compress timelines from design to fully operational 100,000-plus chip deployments, enabling efficiency improvements while supporting algorithmic breakthroughs. This necessitates radical acceleration and automation of every stage, demanding a manufacturing-like model for these infrastructures. From architecture to monitoring and repair, every step must be streamlined and automated to leverage each hardware generation at unprecedented scale.
Meeting the moment: A collective effort for next-gen AI infrastructure
The rise of gen AI marks not just an evolution, but a revolution that requires a radical reimagining of our computing infrastructure. The challenges ahead — in specialized hardware, interconnected networks and sustainable operations — are significant, but so too is the transformative potential of the AI it will enable.
It is easy to see that our resulting compute infrastructure will be unrecognizable in the few years ahead, meaning that we cannot simply improve on the blueprints we have already designed. Instead, we must collectively, from research to industry, embark on an effort to re-examine the requirements of AI compute from first principles, building a new blueprint for the underlying global infrastructure. This in turn will result in fundamentally new capabilities, from medicine to education to business, at unprecedented scale and efficiency.
Amin Vahdat is VP and GM for machine learning, systems and cloud AI at Google Cloud.
Source link