Advantages of network-on-chips over traditional architectures

Paul Boughton

Several trends have forced evolutions of systems architectures, in turn driving evolutions of required system busses. Application convergence means various traffic types such as video, communication and computing in the same SoC design must now share resources that were assumed to be aprivate' and handcrafted to the particular traffic in previous designs.

Moore's law is also driving the integration of many IP Blocks in a single chip. This is an enabler to application convergence, but also allows entirely new approaches (parallel processing on a chip using many small processors) or simply allows SoCs to process more data streams (such as communication channels).
Silicon process evolutions between generations now means gates cost relatively less than wires, both from an area and performance perspective, than a few years ago, and time-to-market pressures are driving most designs to make heavy use of synthesisable RTL rather than manual layout, in turn restricting the choice of available implementation solutions to fit a bus architecture into a design flow.
These trends have driven of the evolution of many new bus architectures. These include the introduction of split and retry techniques; removal of tri-state buffers and multi-phase-clocks; pipelining; and various attempts to define standard communication sockets.
However, history has shown that there are conflicting tradeoffs between compatibility requirements, driven by IP blocks reuse strategies, and the introduction of the necessary bus evolutions driven by technology changes: In many cases, introducing new features has required many changes in the bus implementation, but more importantly in the bus interfaces (for example, the evolution from AMBA ASB to AHB2.0, then AMBA AHB-Lite, then AMBA AXI), with major impacts on IP reusability and new IP design.
Busses do not decouple the activities generally classified as transaction, transport and physical layer behaviours. This is the key reason they cannot adapt to changes in the system architecture or take advantage of the rapid advances in silicon process technology.
Consequently, changes to bus physical implementation can have serious ripple effects upon the implementations of higher-level bus behaviour. Replacing tri-state techniques with multiplexers has had little effect upon the transaction levels. Conversely, the introduction of flexible pipelining to ease timing closure has massive effects on all bus architectures up through the transaction level.
Similarly, system architecture changes may require new transaction types or transaction characteristics. Recently, such new transaction types as exclusive accesses have been introduced near simultaneously within OCP2.0 and AMBA AXI socket standards. Out-of-order response capability is another example. Unfortunately, such evolutions typically impact the intended bus architectures down to the physical layer, if only by addition of new wires or op-codes. Thus, the bus implementation must be redesigned. As a consequence, bus architectures can not closely follow process evolution, nor system architecture evolution. The bus architects must always make compromises between the various driving forces, and resist change as much as possible. In the data communications space, LANs & WANs have successfully dealt with similar problems by employing a layered architecture. By relying on the OSI model, upper and lower layer protocols have independently evolved in response to advancing transmission technology and transaction level services.
The decoupling of communication layers using the OSI model has successfully driven commercial network architectures, and enabled networks to follow very closely both physical layer evolutions (from the Ethernet ulti-master coaxial cable to twisted pairs, ADSL, fibre optics, wireless) and transaction level evolutions (TCP, UDP, streaming voice/video data). This has produced incredible flexibility at the application level (web browsing, peer-to-peer, secure web commerce, instant messaging, etc), while maintaining upward compatibility (old-style 10Mb/s or even 1Mb/s Ethernet devices are still commonly connected to LANs).
Following the same trends, networks have started to replace busses in much smaller systems: PCI-Express is a network-on-a board, replacing the PCI board-level bus.
Replacement of SoC busses by NoCs will follow the same path, when the economics prove that the NoC either: reduces SoC manufacturing cost; increases SoC performance; reduces SoC time to market and/or NRE; reduces SoC time to volume; reduces SoC design risk.
In each case, if all other criteria are equal or better NoC will replace SoC busses.

NoC architecture

The Network-on-Chip (NoC) architecture developed by Arteris employs system-level network techniques to solve on-chip traffic transport and management challenges. Synchronous bus limitations lead to system segmentation and tiered or layered bus architectures (Fig.1). With the Arteris approach in Fig. 1, the NoC is a homogeneous, scalable switch fabric network, and this switch fabric forms the core of the NoC technology and transports multi-purpose data packets within complex, IP-laden SoCs. Key characteristics of this architecture are: layered and scalable architecture; flexible and user-defined network topology; point-to-point connections and a Globally Asynchronous Locally Synchronous
(GALS) implementation decouple the IP blocks.
NoC layers IP blocks communicate over the NoC using a three layered communication scheme (Fig.3), referred to as the Transaction, Transport, and Physical layers. The Transaction layer defines the communication primitives available to interconnected IP blocks. Special NoC Interface Units (NIUs), located at the NoC periphery, provide transaction-layer services to IP blocks with which they are paired. This is analogous, in data communications networks, to Network Interface Cards that source/sink information to the LAN/WAN media. It defines how information is exchanged between NIUs to implement a particular transaction.
For compatibility with existing bus protocols, the NoC implements traditional address-based Load/ Store transactions, with their usual variants including incrementing, streaming, wrapping bursts, and so forth. It also implements special transactions that allow sideband communication between IP Blocks.
The Transport layer defines rules that apply as packets are routed through the switch fabric. Very little of the information contained within the packet (typically, within the first cell of the packet, also known as header cell) is needed to actually transport the packet.
The packet format is very flexible and easily accommodates changes at transaction level without impacting transport level.
A single NoC typically uses a fixed packet format that matches the complete set of application requirements. However, multiple NoCs using different packet formats can be bridged together using translation units.
The Transport Layer may also be optimised to application needs. For example, wormhole packet handling decreases latency and storage but might lead to lower system performance when crossing local throughput boundaries, while store-and forward handling has the opposite characteristics.
The Arteris architecture allows optimisations to be made locally. Wormhole routing is typically used within synchronous domains in order to minimise latency, but some amount of store-and forward is used when crossing clock domains.
Meanwhile the physical layer defines how packets are physically transmitted over an interface, much like Ethernet defines 10Mb/s, 1Gb/s, etc, physical interfaces. Protocol layering allows multiple physical interface types to coexist without compromising the upper layers. Thus, NoC links between switches can be optimised with respect to bandwidth, cost, data integrity, and even off-chip capabilities, without impacting the transport and transaction layers. Arteris has also defined a special physical interface that allows independent hardening of physical cores, and then connection of those cores together, regardless of each core clock speed and physical distance within the cores (within reasonable limits guaranteeing signal integrity). This enables true hierarchical physical design practices (Fig. 3).
The first implementation of NoC uses a GALS (globally asynchronous, locally synchronous) approach. NoC units are implemented in traditional synchronous design style (a unit being for example a switch or an NIU), and sets of units can either share a common clock or have independent clocks. In the latter case, special links between clock domains provide clock resynchronisation at the physical layer, without impacting transport or transaction layers. This approach enables the NoC to span an SoC containing many IP Blocks or groups of blocks with completely independent clock domains, reducing the timing convergence constraints during back-end physical design steps.
This layering fits naturally into a divide-and-conquer design and verification strategy. For example, major portions of the verification effort need only concern itself with transport level rules since most switch fabric behaviour may be verified independent of transaction states. Complex, state-rich verification problems are simplified to the verification of single NIUs; the layered protocol ensures interoperability between the NIUs and transport units.
In spite of the obvious advantages, a layered strategy to on-chip communication must not model itself too closely on data communications networks. In data communication networks the transport medium (ie, optical fibre) is much more costly than the transmitter and receiver hardware and often employs awave pipelining' (ie multiple symbols on the same wire in the case of fibre optics or controlled impedance wires). Inside the SoC the relative cost and performance of wires and gates is different and wave pipelining is too difficult to control. As a consequence, NoCs will not for the foreseeable future serialise data over single wires, but find an optimal trade-off between clock rate (100MHz to 1GHz) and number of data wires (16, 32, 64a) for a given throughput.
The Table shows the results of a simulation of an SoC with 72IP blocks -- 36masters and 36slaves (the ratio between slaves and masters does not really matter, but the slaves usually define the upper limit of system throughput) The total number of IP Blocks implies a hierarchical interconnect scheme; we assume that the IP Blocks are divided in 9 clusters of 8IP Blocks each.
Within each cluster, IP blocks are locally connected using a local bus or a switch, and the local busses or switches are themselves connected together at the SoC level. This shows that for designs of the complexity level that we used for the comparison, the NoC approach has a clear advantage over traditional busses for nearly all criteria, most notably system throughput.
Hierarchies of crossbars or multilayered busses have characteristics somewhere in between traditional busses and NoC, however they fall far short of the NoC with respect to performance and complexity.
Detailed comparison results depend on the SoC application, but with increasing SoC complexity and performance, the NoC is the best IP block integration solution for high-end SoC designs.

Alain Fanet is CEO and one of three founders of Arteris, Paris, France. " target="_blank">www.arteris.com

"

Recent Issues