Abstract
Data-driven applications are becoming increasingly important, fueled by the rapid rise of the Internet of Things (IoT) and Artificial Intelligence (AI). Systems must now be able to store, process and act swiftly on increasingly large amounts of data, while consuming minimum possible power. This shifts the focus to system-level integration and optimization – especially as Moore's Law slows down, and technology development at 5nm and beyond becomes increasingly harder and more expensive. SEMI has built a cross-supply-chain collaborative platform specifically to enable an early assessment of trade-offs and future technologies (5–8 years out). The first project focused on interconnect strategies, which are critical to most computing systems. We examined the performance limits for the best possible options for on-chip interconnects at technology nodes <= 20 nm. These limits highlight the need for system-level strategies, and we studied these by comparing a two-dimensional (2D) system with an interposer-based system (2.5D) to quantify the impact of the latter on the energy-delay product for various applications, especially data-driven ones.
I. Introduction
The microelectronics industry is facing inflection points for both technology and the market. Technology challenges are getting more difficult and expensive as Moore's Law begins to slow, and there are increasing requirements for integrating multiple functions involving processors, memory, sensors and other devices in systems. In parallel, applications driven by the Internet of Things (IoT) and Artificial Intelligence (AI) are requiring systems to handle increasing amounts of data. This calls for increased collaboration to reduce cost and improve innovation efficiency by breaking down traditional silos, particularly for system-level optimization. However, some important collaborative industry mechanisms, such as the International Technology Roadmap for Semiconductors (ITRS), have gone away, creating a gap in the global microelectronics ecosystem. SEMI (www.semi.org) has built a collaborative, cross-supply-chain platform to bridge this gap and provide technology stewardship to identify challenges early, and address them efficiently at the industry-level where possible. As systems become more complex, the ability to optimize interconnects becomes at least as, if not more, important as the ability to optimize individual components. In this paper, we first examine the limitations of on-chip interconnects, and then propose a framework for system-level optimization.
II. Limitations of On-Chip Interconnects
A. Wire Resistivity Limits
The resistivity of Cu, the industry workhorse for on-chip interconnects, increases as we continue scaling wire-widths to narrower dimensions. There are two reasons for this: (1) the carrier mean free path becomes longer than the narrow wire-width, and (2) carrier scattering from surfaces and grain boundaries increases as the wire-width and volume decrease. Additionally, to prevent migration of Cu atoms into the dielectric, most technologies employ a barrier layer made of Ta/TaN or other materials. This barrier does not scale with the wire-width, and makes the resistivity increase even steeper [1].
Fig. 1 shows the rapid increase in resistivity with narrowing wire-widths: copper resistivity almost quadruples from the 28/22nm to the 5nm node [1,2,3]. There is ongoing research in alternative materials like Ru and Co [4,5,6], but the absolute resistivity of narrower lines increases dramatically for all the materials being studied. 80% of wafer production in 2017 was at technology nodes >= 22 nm [7], where the narrowest wire-width is 40–45 nm. Using this as a baseline, the equivalent baseline resistivity is about 3 micro-ohm-cms (Fig. 2). At the 10/7 nm node, with wire-widths of about 20 nm, the resistivity triples over the baseline to about 10 micro-ohm-cms [8]. At the 5 nm node, with wire-widths of about 10 nm, the resistivity increases almost 7-fold over the baseline to 20 micro-ohm-cms [9,10,11].
B. Circuit Resistance Limits
Fig. 2 shows the increase in circuit delay with increasing wire length, for a fixed via resistance of 50 ohms. As the wire resistivity gets higher, the delay increases rapidly with wire length. For example, the circuit delay – for a typical wire length of 15 micrometers with a wire resistivity of 10 micro-ohm-cm at the 10/7 nm node – is twice as large as the delay for a baseline wire with resistivity of 3 micro-ohm-cm. At the 5nm node and beyond, where wire resistivity will be >= 20 micro-ohm-cm, the delay for a 15 micrometer long wire is >= 3 times that of the baseline.
In addition to performance degradation, the complexity of metal process technology for on-chip interconnects has increased the back-end-of-the-line (BEOL) cost, so that it is now >= 50% of total cost [12].
III. System-Level Strategies
Section II highlights the limits of front-end wires and circuits alone in achieving the performance required by future data-driven systems, particularly IoT and AI-based ones that may require up to 40 GB of data to be processed and exchanged between processor and memory chips. The energy required to move data between the chips varies over two orders of magnitude from 10 pJ/bit in traditional 2D multi-chip systems on printed circuit boards (PCBs) to 0.1 pJ/bit in monolithic implementations of systems on the chip (SoC).
Complexity and cost of building monolithic semiconductor devices are rapidly increasing, stretching design cycles to several years for the latest technology nodes. At the same time, System in Package (SiP) solutions are emerging that mitigate this complexity increase. Such SiP technologies can improve system-level performance and reduce energy compared to that of PCB-based technologies, addressing 100X energy gap between PCB and SoC implementations. In addition, SiP era will unlock a new opportunity of combining heterogenous technologies that will lead to a new design paradigm. Dynamics and challenges of this new paradigm in the context of data-driven applications require system-level optimization for figures of merit such as throughput and latency.
Emerging SiP solutions span a wide range of architecture directions based on multi-chip modules (MCM) for low cost packaging to interposer-based (2.5D) technologies for high-density connectivity. This highlights the requirement of a system-level strategy that includes optimization of die-to-die interconnects for trade-offs such as cost, power, throughput, and latency. In this work we focus on modeling the latest high-density interposer-based packaging solutions in order to clarify the system-level trade-offs for a number of representative applications.
A. Modeling Framework
At the high level, there is a trade-off between the density of chip-to-chip interconnects and their corresponding cost of manufacturing. Highly dense interconnects on a 2.5D interposer-based system not only move data at a higher bandwidth, but also reduce the required energy and latency, but they cost significantly more than traditional 2D PCB technologies. Solutions such as Fan-Out Wafer Level Packaging (FOWLP), which are gaining increasing prominence, fall in between two extremes of 2D and 2.5D, both in terms of performance as well as cost.
We have developed a modeling framework to assess and quantify the improvement of 2.5D SiP scheme over traditional 2D PCB technology. We modeled the performance of these systems for a number of workloads – both conventional computing and emerging data-driven applications. We chose the energy-delay product (EDP) as the relevant figure of merit because it covers both performance and power, which are critical to most systems. Figs. 3 and 4 show the high-level schematic of the 2D baseline and 2.5D interposer architectures that we compared with our modeling framework. The 2D baseline we have used is common in many systems, and the 2.5D interposer-based architecture is a representative system being used today for most AI applications.
B. Modeling Details
Working with our industry partners, we selected the input parameters so that they were accurately calibrated with ground realities in the industry. We used the following input parameters for each of the 2D and 2.5D architectures:
Memory: 64 GB DRAM for off-chip memory (split into 8 high bandwith memory chips for the 2.5D architecture, with a DDR interface), 32 MB silicon SRAM for the L2 Cache, and 32 KB silicon SRAM for the L1 data and instruction caches. [15]
Read/Write cycle time: 60 ns for off-chip memory, 11 ns for L2 cache, 2 ns for L1 data cache and 1.5 ns for L1 instruction cache. [16]
Energy expended per bit: 52 pJ for off-chip memory for the 2D architecture [17] and 8pJ for the 2.5D architecture, 2.2 pJ ns for L2 cache, 1.25 pJ for L1 data cache and 0.31 pJ for L1 instruction cache.
Processor: 64 in-order cores operating at 2 GHz frequency with 0.5 nJ dynamic energy and 0.3 W leakage per core.
Thus we have an apples-to-apples comparison for the processor and the memory for the 2D and 2.5D systems, and the main difference arises from the difference in the off-chip memory interconnect and the energy required to move the data off-chip.
Fig. 5 shows further details of the 2.5D interposer architecture, including our assumptions for resistance, capacitance, wire & microbump pitch, and microbump diameter.
C. Modeling Results
We applied the modeling framework to cover 2 application classes:
Fig. 6 shows the ratio of improvement in EDP for 2.5D over traditional 2D systems – 1X corresponds to the performance of the baseline 2D system. Fig. 6 shows that conventional computing applications are improved by 10–20% with only one application showing 90% improvement. On the other hand, data-driven applications show an average improvement of almost 400% (4X). The underlying reason is that much more data needs to move between the processor and memory in data-driven applications, and they derive a larger benefit from the shorter data paths in the 2.5D architecture.
Fig. 7 shows the details of how time and energy are expending in executing a specific application – PageRank. 88% of the energy and 91% of the time are spent in the processor being stalled and/or waiting for memory access, and the 2.5D interposer-based architecture provides improvement by reducing this time significantly. Note that the processor performance is unchanged, which is what we would expect, since our model uses identical processors in both cases.
D. Sensitivity Analysis
We also performed a sensitivity analysis as shown in Table 1 to determine the effect of tweaking technology “knobs” on the system performance. Specifically, we examined the impact of doubling the number of microbumps, and changing the substrate to a hypothetical material with zero energy loss. As Table 1 shows, doubling the number of connections offers a 10% improvement in the EDP, and further, eliminating the energy loss in the substrate provides an additional 3% improvement.
E. Analysis and Insights
Several important insights come out from our analysis.
On-chip interconnects are becoming a critical bottleneck for performance, power and cost, and despite much materials research, circuit performance will be increasingly challenged as wire-widths narrow to 10 nm at the 5nm node.
Future complex systems will require optimal integration of multiple components to provide diverse functionality and improved system-level performance.
We have demonstrated modeling framework applied to one such heterogeneous system combining processors and memory for data-driven applications.
Our modeling results demonstrate that 2.5D-based interposer architectures offer limited advantage (average 10–20%) over traditional 2D systems for conventional computing applications.
Our modeling results further demonstrate that 2.5D-based interposer architectures benefit data-driven applications by a factor of 4X.
For future data-driven applications that require greater performance improvement is needed, new innovations will be required. Since we can make the memory connectivity faster with new 2.5D and 3D construction we can keep overall system performance improving a consistent rate even though the processor improvement has slowed.
IV Future 3D Systems
Future AI and IoT applications may require a 1000X improvement in EDP. Two of our co-authors, Prof. Subhasish Mitra and Prof. Philip Wong have been working with a team to develop revolutionary new innovations that can provide improvements on this scale. Fig. 9 shows such a system using carbon-nanotube based transistors, magnetic and resistive RAM, and ultra-dense, fine-grained vias to connect the components [18]. This system has been demonstrated to provide 1950X improvement over a traditional PCB-based 2D system for a language model application.
Conclusion
Increasing technology complexity, the requirement for functional diversity, and the emerging market drivers from AI and IoT applications are creating new opportunities and challenges across the entire microelectronics supply-chain. System-level optimization will be critical to achieve the performance required, while minimizing power and cost. There is a wide spectrum of technology “knobs” available to improve system-level performance, ranging from 2D PCB-based systems, to 2.5D interposer-based systems up to revolutionary 3D systems. We have demonstrated here a modeling framework and applied it to compare 2.5D benefits over traditional 2D systems, particularly for data-driven applications. The key for business/technology decision-makers is to have a framework such as the one presented here to compare the cost/benefit trade-off of various technology options for various applications to determine the optimal fit.