Silicon Mechanics Optimize Resource Use & ROI with Composable Infrastructure

Optimize Resource Use & ROI with Composable Infrastructure

December 14, 2021

Optimizing Resource Utilization and Maximizing ROI with Composable Infrastructure

Today’s IT organizations must maximize their resource utilization to deliver the computing capabilities their organization needs when and where it’s needed. This has resulted in many organizations building multi-purpose clusters, which impacts performance.

Even worse from an ROI perspective, in many instances, once resources are no longer required for a particular project, they cannot be redeployed to another workload with precision and efficiency. Composable disaggregated infrastructure (CDI) can hold the key to solving this optimization problem, while also providing bare metal performance.

What is CDI?

At its core, CDI is the concept of using a set of disaggregated resources connected by a NVMe over fabric solution so that you can dynamically provision hardware, regardless of scale. This infrastructure design provides the flexibility of the cloud and the value of virtualization but the performance of bare metal. Because it decouples applications and workloads from the underlying hardware, CDI offers the ability to run diverse workloads on a cluster while still optimizing for each workload and even support multi-tenant environments.

Software providers often used in CDI-based clusters include Liqid CDI and Giga IO. Liqid Command Center™ is a powerful management software platform that dynamically composes physical servers on demand from pools of bare-metal resources. GigaIO FabreX is an enterprise-class, open-standard solution that enables complete disaggregation and composition of all resources in the rack.

What are the technical and business benefits of clusters that include CDI?

The disaggregated resources in CDI allow you to dynamically provision clusters using best fit hardware without the reduction in performance that you would get in a cloud-based environment. With respect to HPC and AI, the value of CDI comes from the flexibility of the underlying hardware, different workloads, and environments. This improves cost effectiveness and scalability compared to cloud services and cloud service providers, improving ROI and lowering costs.

For AI and HPC workloads, performance is still top priority and on-premises hardware provides better performance, with the ability to burst to the cloud on an as-needed basis. A well-designed cluster built with commercial off-the-shelf (COTS) hardware elements and connected with PCIe, Ethernet, and InfiniBand can increase the utilization, flexibility, and effective use of valuable data center assets. Organizations that implement CDI realize a 2x to 4x increase in data center resource utilization, on average.

Beyond optimizing resource allocation, CDI also provides several additional benefits for your dynamically configured system:

Support for multiple workloads with different technical requirements without major administration efforts
Cost-effective, scalable performance to support workloads beyond the capabilities of cloud service providers
Future proofing, in case of strategic direction changes, new project requirements, etc.
A central resource for a diverse user base and workload set

What are ideal use cases for CDI?

A wide variety of technology areas can benefit from CDI. These include:

Multi-tenant environments
HPC and simulation
AI and machine learning (ML)
Cloud-like computing
Engineering and visualization
VFX and digital production

For deep learning, it is best to keep clusters on-premises because on-premises computing can be more cost-effective than cloud-based computing when highly utilized. It’s also advisable to keep primary storage close to on-premises compute resources to maximize network bandwidth while limiting latency.

What are the key components of a CDI cluster?

There are two critical factors in deploying a successful CDI-based cluster. The first is a design that properly integrates leading-edge CDI software.

As mentioned above, two software platforms often used in CDI clusters are Liqid Command Center and GigaIO FabreX. Both are technologies Silicon Mechanics has worked with before and uses in our CDI-based clusters.

Liqid Command Center is a fabric management software for bare-metal machine orchestration. Command Center provides:

Policy-based automation and dynamic provisioning of resources

Advanced cluster, machine, and device statistics and monitoring

Scalable architecture supporting high availability (HA)

Multiple control methods, including GUI and RESTful API

GigaIO FabreX is an open-standard solution that allows you to use your preferred vendor and model for servers, GPUS, FPGAs, storage, and for any other PCIe resource in your rack. In addition to composing resources to servers, FabreX can compose servers over PCIe. FabreX enables true server-to-server communication across PCIe and makes cluster scale compute possible, with direct memory access by an individual server to system memories of all other servers in the cluster fabric.

High-performance, low-latency networking, like InfiniBand from NVIDIA Networking, is the second critical element to the way CDI operates. It’s possible to disaggregate just about everything—compute (Intel, AMD, FPGAs), data storage (NVMe, SSD, Intel Optane, etc.), GPU accelerators (NVIDIA GPUs), etc. You can rearrange these components however you see fit, but the networking underneath all those pipes stays the same. Think of networking as a fixed resource with a fixed effect on performance, as opposed to other resources that are disaggregated.

It is important to plan out an optimal network strategy for a CDI deployment. InfiniBand is ideal for large scale or high performance. Conversely, Ethernet is a strong choice for smaller clusters. If you expand over time, you’ve got that underlying network to support anything that comes up in the lifecycle of that system.

How can CDI help handle demanding HPC and AI workflows?

Today, many organizations run demanding and complex workflows, such as HPC and AI, that require massive levels of costly resources. This drives IT departments to find flexible and agile solutions that effectively manage the on-premises data center while delivering the flexibility typically provided by the cloud. CDI is quickly emerging as a compelling option to meet the demands for deploying applications that incorporate advanced technologies.

Silicon Mechanics is an engineering firm providing custom, best-in-class solutions for HPC/AI, storage, and networking, based on open standards. The Silicon Mechanics Miranda CDI Cluster is a Linux-based reference architecture that provides a strong foundation for building disaggregated environments.

Get a comprehensive understanding of CDI clusters and what they can do for your organization by downloading the Inside HPC white paper on CDI.

About Silicon Mechanics

Silicon Mechanics, Inc. is one of the world’s largest private providers of high-performance computing (HPC), artificial intelligence (AI), and enterprise storage solutions. Since 2001, Silicon Mechanics’ clients have relied on its custom-tailored open-source systems and professional services expertise to overcome the world’s most complex computing challenges. With thousands of clients across the aerospace and defense, education/research, financial services, government, life sciences/healthcare, and oil and gas sectors, Silicon Mechanics solutions always come with “Expert Included” ^SM.

Latest News

In today's data-driven environment, a one-size-fits-all storage solution isn't enough.

December 2, 2024

Deciding Between Intel Xeon 6 P-Cores and E-Cores: Which Fits Your Workload?

September 30, 2024

With Intel's latest Xeon 6 processor release, businesses now have two distinct core options to choose from: Performance-cores (P-cores) and Efficiency-cores (E-cores). But how do you determine which is the right fit for your specific computing needs?