In this first of a multi-part series on the excellent Hot Interconnects 2015 conference, held in Santa Clara, CA, August 26-28, 2015, we highlight presentations from three major IT infrastructure vendors: Oracle, Intel, and Facebook. A follow on article will examine Huawei’s open source activities, Deep Machine Learning and its networking applications (from Brocade Research), and Software Defined WANs (which is a much broader topic than SDN for WANs).
Oracle Opening Keynote – Commercial Computing Trends and Their Impact on Interconnect Technology by Rick Hetherington:
Capability, cost, reliability, power consumption, footprint are the key considerations for all IT equipment, but especially for cloud Data Centers (DCs).
Authors Note: Oracle maintains they have a competitive edge in compute servers via their SPARC microprocessors, for which seven have been commercially released in the last six years. We’ve wondered if SPARC propelled compute servers can be competitive with those built using x86 or ARM processors. Evidently, the Oracle hardware division (formerly Sun Microsystems) says YES!
The latest SPARC chip, code-named Sonoma, will be released in 2016. Designed for highly dense DCs and focused on “cores and caches,” Sonoma is a low-cost version of the SPARC M7 processor (announced last year) and is designed for highly dense data centers. The 20nm processor offers eight four-generation SPARC S4 cores as well as PCIe Gen 3 and InfiniBand integrated into the silicon. Connectivity was said to be optimized for “scale-out.” SONOMA’s value add was said to be in “cost and convergence, cloud computing/storage, and software on silicon acceleration,” but none of those characteristics were explained.
Oracle has evolved from a non believing cloud company to the #2 world-wide provider of Software as a Service (SaaS). They also offer Data (Storage) as a Service, Platform as a Service, and Infrastructure as a Service. The company claims to have deployed over 120K virtual machines in 19 DCs. The edge/core servers connection speeds are 10G/40G today, which will be upgraded to 25G-40G/100G in the near future. Rick said that growth in cloud DC traffic is pushing higher interconnect speeds. He noted that Oracle’s “Super Cluster” uses an “InfiniBand network in a full rack.”
In conclusion, Rick said that commercial computing was converging on the following:
- 2 socket scale-out technology
- Microprocessors with multiple cores and many threads per core
- Enterprise DC (premises resident) microprocessors have bandwidth needs which are being met
- Cloud DC (resident) microprocessors require efficiency in cost, power, packaging and speed
Intel Omni-Path Architecture: Enabling Scalable, High Performance Fabrics, by Todd Rimmer:
Unveiled for the first time at Hot Interconnects 2015, Intel’s Omni-Path Architecture (OPA) was said to be a “multi-generational interconnection fabric, designed specifically for both high-performance computing (HPC) and compute/data servers.” OPA is capable of scaling to tens of thousands of nodes—and eventually more—at a price competitive with today’s fabrics, according to the company’s website description.
“The Intel OPA 100 Series product line is an end-to-end solution of PCIe* adapters, silicon, switches, cables, and management software.” We leave it to the reader to investigate due to time and space limitations of this article.
To obtain the whitepaper titled: Intel® Omni-Path Architecture: Enabling Scalable, High Performance Fabrics, you must fill out a form at this link.
From the above referenced Intel whitepaper:
“The Intel OPA is designed for the integration of fabric components with CPU and memory components to enable the low latency, high bandwidth, dense systems required for the next generation of data center. Fabric integration takes advantage of locality between processing, cache and memory subsystems, and communication infrastructure to enable more rapid innovation. Near term enhancements include higher overall bandwidth from the processor, lower latency, and denser form factor systems.”
Facebook Panel Participation & Intra- DC invited talk by Katharine Schmidtke:
A 90 minute panel session on “HPC vs. Data Center Networks” raised more questions than it answered. While comprehensively covering that panel is beyond the scope of this article, we highlight a few takeaways and the comments and observations made by Facebook’s Katharine Schmidtke, PhD.
- According to Mellanox and Intel panelists, InfiniBand is used to interconnect equipment in HPC environments, but ALSO in large DC networks where extremely low latency is required. We had thought that 100% of DCs used 1G/10G/40G/100G Ethernet to connect compute servers to switches and switches to each other. That might be closer to 90 or 95%, with InfiniBand and proprietary connections making up the rest.
- Another takeaway was that ~80 to 90% of cloud DC traffic is now East-West (server to server via a switch/router) instead of North-South (server to switch or switch to server) as it had been for many years.
- Katharine Schmidtke, PhD talked about Facebook’s intra DC optical network strategy. Katharine is responsible for Optical Technology strategy at Facebook. [She received a PhD in non-linear optics from Southampton University in the UK and did post doctoral work at Stanford University.]
- There are multiple FB DCs within each region.
- Approximately 83% of active daily FB users reside outside the US and Canada.
- Connections between DCs are called Data Center Interconnects (DCIs). There’s more traffic within a FB DC than in a DCI.
- Fabric, first revealed last November, is the next-generation Facebook DC network. It’s a single high-performance network, instead of a hierarchically oversubscribed system of clusters.
- Wedge, also introduced in 2014, is a Top of Rack (ToR) Switch with 16 to 32 each 40G Ethernet ports. It was described as the first building block for FB disaggregated switching technology. Its design was the first “open hardware switch” spec contribution to the Open Compute Project (OCP) at their 2015 annual meeting. Facebook also announced at that same OCP meeting that it’s opening its central library of FBOSS – the software behind Wedge.
- Katharine said FB was in the process of moving from Multi-Mode Fiber (MMF) to Single Mode Fiber (SMF) for use within its DC networks, even though SMF has been used almost exclusively for telco networks with much larger reach/distance requirements. She said CWDM4 over duplex SMF was being implemented in FB’s DC networks (more details in next section).
- In answer to a question, Katherine said FB had no need for (photonic) optical switching.
Facebook Network Architecture & Impact on Interconnects:
FB’s newest DC, which went on-line Nov 14, 2014, is in Altoona, IA, which is just north of Interstate Highway 80. It’s a huge nondescript building which is 476K square feet in area. It’s cooled using outside air, uses 100% renewable energy and is very energy-efficient in terms of overall power consumption (more on “power as pain point” below). Connectivity between DC switches is via 40G Ethernet over MMF in the “data hall.”
Fabric (see above description) has been deployed in the Altoona DC. Because it “dis-aggregates” (i.e. breaks down) functional blocks into smaller modules or components, Fabric results in MORE INTERCONNECTS than in previous DC architectures.
As noted in the previous section, FB has DCs in five (soon to be seven) geographic regions, with multiple DCs per region.
100G Ethernet switching, using QSFP281 (Quad Small Form-factor Pluggable) optical transceivers, will be deployed in 2016, according to Katharine. The regions or DCs to be upgraded to 100G speeds were not disclosed.
Note 1. The QSFP28 form factor is the same footprint as the 40G QSFP+.The “Q” stands for “Quad.” Just as the 40G QSFP+ is implemented using four 10-Gbps lanes or paths, the 100G QSFP28 is implemented with four x 25-Gbps lanes or paths.
Cost efficient SMF optics is expected to drive the price down to $1/Gbit/sec very soon. SMF was said to be “future proofing” FB’s intra DC network2, in terms of both future cost and ease of installation. The company only needs a maximum reach of 500m within any given DC, even though SMF is spec’d at 2km. Besides reach, FB relaxed other optical module requirements like temperature and lifetime/reliability. A “very rapid innovation cycle” is expected, Katharine said.
Note 2. Facebook’s decision to use SMF was the result of an internal optical interconnects study. The FB study considered multiple options to deliver greater bandwidth at the lowest possible cost for its rapidly growing DCs. The 100G SMF spec is primarily for telcos as it supports both 10Km and 2Km distances between optical transceivers. That’s certainly greater reach than needed within any given DC. FB will use the 2Km variant of the SMF spec, but only up to 500m. “If you are at the edge of optical technology, relaxing just a little brings down your cost considerably,” Dr. Schmidtke said.
A graph presented by Dr. Schmidtke, and shown in EE Times, illustrates that SMF cost is expected to drop sharply from 2016-to-2022. Facebook intends to move the optical industry to new cost points using SMF with compatible optical transceivers within its DCs. The SMF can also be depreciated over many years, Katharine said.
FB’s deployed optical transceivers will support Coarse Wavelength Division Multiplexing 4 (CWDM4) Multi-Source Agreement over duplex SMF. CWDM4 is a spec for 4 x 25G Ethernet modules and is supported by vendors such as Avago, Finisar, JDSU, Oclaro and Sumitomo Electric.
CWDM4 over duplex SMF was positioned by Katharine as “a new design and business approach” that drives innovation, not iteration. “Networking at scale drives high volume, 100s of thousands of fast (optical) transceivers per DC,” she said.
Other interesting points in answer to audience questions:
- Patch panels (which interconnect the fibers) make up a large part of Intra DC optical network system cost. For more on this topic, here’s a useful guide to fiber optics and premises cabling.
- Power consumed in switches and servers can’t keep scaling up with bandwidth consumption. For example, if you double the bandwidth, you CAN’T double the power consumed! Therefore, it’s critically important to hold the power footprint constant as the bandwidth is increased.
- More power is consumed by the Ethernet switch chip than an optical transceiver module.
- Supplying large amounts of power into a mega DC is the main pain point for the DC owner (in addition to the cost of electricity/power there are significant cooling costs as well).
- FB is planning to move fast to 100G (in 2016) and to 400G Ethernet networks beyond that time-frame. There may be a “stop over” at 200G before 400G is ready for commercial deployment, Katharine said in answer to a question from this author.
End Note: 2015 Hot Interconnects Part II has been published. It covers Huawei’s open source activities, Machine Learning and its applications (especially to networking), and highlight of an excellent tutorial on Software Defined WANs.