DCI and the 10x Opportunity in AI Capex

BedRock
May 18
10 min read

The role of DCI has changed. Once computing power is no longer crammed into a data center, demand will not only follow the number of users, but will be amplified by the number of nodes, single connection bandwidth, and topology level.

GPU, HBM, liquid cooling, and power are easier to see because they are all in the computer room. When the model is turned into infrastructure, bottlenecks will pop up all the way: if one computer room is not enough, it will become a campus; if one campus is not enough, it will become multiple data centers; capacity, power, heat dissipation, land, latency, and data compliance will all become hard constraints. At that point, the connection itself is no longer a supporting facility, but the basic layer that determines whether computing power can be organized.

Cloud 1.0: DCI or pipes

In the Cloud 1.0 era, data center interconnection did a relatively simple job: connecting a few large data centers, disaster recovery centers, CDN nodes and enterprise access.

Users access web pages, videos, and SaaS, which is mainly the downstream traffic from the cloud to the user. Computing is concentrated in a few super hubs, such as large data center clusters in Northern Virginia, Oregon, Silicon Valley, and Dallas. The network topology is closer to hub-and-spoke: the central node is strong, the edge nodes serve users, and there are synchronization, backup and disaster recovery between data centers, but most businesses do not need to use multiple data centers as a real-time computing system.

DCI growth then looked more like:

DCI Traffic ≈ Users × Average Consumption × Replication

With more users, higher-definition videos, and more popular SaaS, traffic will naturally grow. But its logic is still relatively linear. More users usually means more egress bandwidth, more cache, more zone replicas, rather than rewriting the network hierarchy every time a new batch of nodes is added.

Don’t use simple analogies with Internet traffic

The most natural analogy is, of course, the Internet. The Internet has also experienced a traffic explosion: web pages, pictures, videos, live broadcasts, and short videos. Each round of application upgrades will push up bandwidth demand.

But there is a hard difference between token traffic and Internet traffic: the end point of Internet traffic is mainly human attention, while the end point of tokens is increasingly machine circulation. There is a time limit for people to watch videos, browse feeds, and type questions; the agent can read documents, search, call tools, write code, run tests, and fix errors, and can continue to run.

Internet traffic is more like content distribution. The same video can be cached, and the same picture can be reused by CDN. Although the traffic of popular content is large, it is relatively predictable. Token is more like a calculation process. If the user changes the prompt writing method, opens thinking, inserts a long document, and turns the chat into a multi-step agent, the amount of calculation behind it may immediately increase to several times, ten times, or even more.

What is more troublesome is not the "AI price increase", but the sudden increase in token usage. Price increases by suppliers require at least a contract period and advance notice; surges in token usage often come from changes in user behavior: employees suddenly learn a new prompt, use an agent, insert a 100-page PDF into context, or increase the search depth from a few items to dozens.

Figure: Token surge is not a switch, but a superposition of several usages Agentization From a single round of chat to multi-step execution and self-correction Thinking The inference process is longer, and the calculation amount of a single request is increased. long document Context changes from a few paragraphs of text to an entire PDF/repo RAG deepening There are more chunks to retrieve, and recall and rearrangement are more complex. free entrance After the assistant becomes addicted, the usage frequency is reopened

These multiples are not intended to be precise forecasts. The trouble lies in stacking up: individual changes are controllable, but stacked together they distort budget, network, and computing power planning.

A request, behind it is a series of work

After the model comes in, this set of accounts will be different.

On the surface, the user just asked a question. The background may have become a set of work: model routing, RAG retrieval, tool invocation, code execution, result verification, multi-agent collaboration, context compression, KV cache management, and even multiple models calling each other.

In the past, tokens were for people to see. People read slowly and type slowly, so there is a natural limit to the speed of consumption. Now tokens are increasingly used by agents. The agent can continue to read files, write code, run tests, fix errors, read context again, and then execute the next step. Human speed is no longer the ceiling for token consumption, machine cycle is.

Network pressure has also changed its form: from "user access to services" to "continuous collaboration between machines". The original formula is no longer sufficient:

DCI Pressure ≈ Users × Models × Agents × Context/Memory × Coordination × Geography

The number of users is growing by one level. Calling multiple models behind each user is the second layer. Splitting a task to multiple agents is the third level. Long context, KV cache, external memory and parallel storage are the fourth layer. Training, inference, model synchronization, checkpoint transfer, cache migration, and cross-region disaster recovery are the fifth layer. Add in geographic distribution, and DCI pressure is no longer a linear curve.

The first level: more nodes

This round of data centers won’t be left with just a few very large centers.

Here we must first separate training and inference.训练更适合中心化，因为它不在乎离终端用户有多近。 Of course, training is extremely concerned with low latency and high bandwidth between GPUs within the cluster, but it is not an online service, and there is no need to deploy the model to every city to wait for user requests.只要电力、散热、土地、网络和运维能支撑，训练会自然集中到少数超大集群。

The reasoning is different. Inference is an online service, and latency, first token return speed, data residency, enterprise private network access, and regional availability will all affect the experience. The further you move towards agentic AI, the longer the requests, the more continuous the interactions, the more tool calls, and the closer the inference needs to be to the user and enterprise data. Therefore, the number of inference nodes will be significantly greater than the number of training nodes, and the node complexity of DCI will also be increased by the inference side.

Training: Centralized

Being far away from the user doesn't matter, the key is GPU density, power, cooling, low-latency cluster networking and stable job completion.

Reasoning: Marginalization/Regionalization

It is closer to users and enterprise data, reduces latency, improves availability, and makes it easier to meet data compliance and private network access.

So there will be more levels in the network: • Edge inference nodes near the user; • City or regional level inference hub; • Large AI region; • Super large training center; • Scale-across interconnections between multiple data centers.

When the number of nodes is large, it is not about pulling a few more lines. If N nodes are completely interconnected, the connection relationship is N(N-1)/2, which naturally grows at a square level. The real world will not be fully interconnected, and the cost will be too high. Cloud vendors will use hierarchical topologies to reduce complexity: edge to regional hub, regional hub to hyperscale hub, and then horizontal connections between AI hubs.

But stratification does not mean that demand disappears. Layering is just a reorganization of requirements. When the number of nodes crosses a certain threshold, the network will not expand smoothly, but will have an extra layer.

This is topology jump: not a few more lines, but an extra layer of network.

Layer 2: Each connection becomes fatter

The number of nodes is one issue, the traffic of a single connection is another.

In the traditional Internet, one user request usually corresponds to one content access. In AI, one user request may correspond to multiple rounds of model calculations. Especially for workloads such as reasoning, MoE, and multi-agent inference, the network not only sends results back to the user, but also continuously transfers intermediate states between models, caches, storage, tools, and different computing resources.

In the past, the market often said that training requires the most network, while inference mainly requires local computing power. This sentence is failing. If reasoning is just a single round of question and answer, it is indeed more like local computing; but once reasoning enters long context, multiple agents, external memory and tool calls, communication becomes a core variable.

A more accurate statement is: training eats the network, and complex reasoning also begins to eat the network. It is not exactly the same kind of network, but it will push optical interconnection, switching, routing, cache and storage systems in the direction of higher bandwidth, lower tail latency, and more stable job completion.

Therefore, the words 400G, 800G, 1.6T, coherent optics, CPO, and OCS will appear at the same time in this round of infrastructure discussions. They are not isolated technical terms, but different outlets of the same pressure: data needs to pass through a larger system in a shorter time.

Layer 3: Topology will jump

Strictly speaking, DCI refers to the interconnection between data centers; leaf-spine and super-spine are more about the internal data center or campus network architecture. This point must be made clear, otherwise it will be easy to confuse the industrial chain.

But the pressure of AI will flow from the internal network all the way to the DCI.

The first step is scale-up and scale-out within a single cluster. As the number of GPUs increases and east-west traffic increases, leaf-spine requires higher radix, higher port rate, and stronger congestion control.

The second step is interconnection between the campus and multiple buildings in close proximity. Multiple computer rooms in a campus need to be scheduled like a resource pool, and optical connections are extended from racks, rows, and halls to buildings.

The third step is cross-data center DCI. Power, land, cooling, supply chain, and regional capacity will all limit single-point expansion, and AI factory cannot be crammed into one building forever. Multiple data centers need to be connected into a larger computing system, and the network begins to bear the burden of distance, delay, jitter, congestion control, and predictable throughput.

Spectrum-XGS proposed by NVIDIA in 2025 calls this direction scale-across: in addition to scale-up and scale-out, data centers distributed in different locations are connected into a larger AI factory. The change behind this sentence is huge: the network problem has extended from "how to connect the GPU in the computer room" to "whether multiple data centers can jointly complete a task."

At this point, DCI is not just getting a little more traffic every year, but the role of the network has changed.

Let’s look at the total amount first

The most critical signal of this round of DCI is not in the single point of news, but in the budget pool itself. AI capex has become a large enough pool of physical capital expenditures.

GPU is the most obvious expenditure, power, liquid cooling and computer room are the second layer. Network and optical interconnection are often seen by the market later. Once the computing power reaches a single computer room, DCI will change from a "just have it" connection to a basic layer of "can the computing power be organized?"

As long as the proportion of DCI in AI capex increases from 1% to more than 2%, the absolute amount will be staggering when multiplied by today's AI capex pool. This is why we care more about the aggregate slope rather than a single point of noise in a quarter.

This could be a 10x opportunity

As the number of nodes increases, the single connection becomes fatter, and the topology jumps, everything will eventually fall into the budget pool.

According to our current rough framework, the total market revenue of DCI systems will go from approximately US$5 billion in 2025 to US$11.7 billion in 2026, 21.5 billion in 2027, 33.9 billion in 2028, 48.7 billion in 2029, and US$82.4 billion in 2038. The period from 2025 to 2029 is the steepest period, followed by a longer period of compound growth.

The proportion of AI capex is more critical. DCI will be only about 1% of AI capex in 2025-2026. Close to 1.8% in 2028, 2.2% in 2029, and 3.0% in 2038. The ratio seems to be only two points more, but the underlying AI capex pool itself is getting larger. These two points are the budget migration from billions of dollars to tens of billions of dollars.

2026-2029 is YoY year by year; 2038 is 2029-2038 CAGR.

This goes back to the title: DCI is no longer an accessory, it could be a close to 10x opportunity in AI Capex. From $5 billion in 2025 to $48.7 billion in 2029, accounting for 1.0% to 2.2% of AI capex. This magnitude is large enough, and there is no need to separate it into a single company to prove its importance.

This is not a bet on a device price increase, but an increase in the weight of network and optical interconnection in the AI capex structure. GPUs were the first wave, power and cooling were the second wave, and network/optical interconnects are becoming the next wave of constraints. As long as this budget migration is established, not only revenue will be revalued later, but also ASP, production capacity and profit margins.

Don’t underestimate this round of demand

There is no shortage of demand for DCI this round. The key is which layer to explode first.

The first wave will most likely not be cross-continental long-distance backbones, but short-distance, high-capacity connections such as campus, metro, and regional. The reason is simple: cloud vendors will try their best to place computing, data and cache in appropriate locations to reduce meaningless cross-regional transfers. But this is not to be short on DCI. On the contrary, the increase in DCI will first grow from the place closest to the AI factory.

Don’t use “efficiency improvement” to erase demand. In the Internet era, the cost per bit continues to decline. The result is not that bandwidth demand disappears, but that videos, live broadcasts, and short videos continue to increase traffic. The same goes for AI. The cheaper the model and network, the more tasks will be thrown to the agent. Efficiency improvements will reduce the cost per call, but open up more calls that would not otherwise occur.

What should be paid attention to is not "whether the price of a certain optical module will increase today", but whether the underlying slopes continue to become steeper:

• Whether the intensity of complex reasoning and agent token usage continues to rise;

• Whether training and inference increasingly require cross-cluster, cross-campus, and cross-region collaboration;

• Whether cloud vendor capex continues to spread from GPUs to networking, storage, power and optical interconnects;

• Whether the implementation pace of 800G, 1.6T, coherent, CPO, and OCS has been verified by real orders;

• Is the shortage of lead times and key components a short-term disturbance or a new round of supply bottlenecks?

Finally look at the border

Don't think of DCI as an ordinary traffic curve.

The underlying change is that computing is becoming a distributed system. Models are getting larger and larger, tasks are getting longer and longer, there are more and more agents, reasoning is getting more and more complex, and data centers are becoming more and more constrained by power and physical space. In order for these distributed resources to work like a system, the network must move from a back-end pipeline to a front-end capability.

The next thing to watch is not whether DCI will grow, but when the computing power will cross the physical boundary of a single data center. Once it is crossed, the new demand is not just a few connections, but an entire layer of network.

This is the most underestimated aspect of this round of DCI.

BR Partners

DCI and the 10x Opportunity in AI Capex

Recent Posts

Comments