Silicon Mechanics Deep Learning

Selecting A Deep Learning Infrastructure

Most people know machine learning (ML) as the recommendations you see on shopping sites and the song suggestions you see on music services. It’s moved so far into the mainstream, it’s often just called AI, as if that’s all there is to AI.

The new frontier of ML is deep learning (DL), the advanced realm of self-driving cars. However, the infrastructure needs of DL are different from ML hardware. Infrastructure implementations can and do vary but there are some core considerations you need to consider as you build out your system.

Fortunately, modern open-source technology allows highly qualified engineering teams to design highly effective clusters that are also very strong on ROI.

AI Terms

Machine Learning refers to the use of AI algorithms to parse data, learn from that data, and then apply what they’ve learned to make informed decisions (or infer that the data has certain attributes based on what it has learned – see “inference” below).
Deep Learning is a subset of machine learning that structures algorithms in layers to create an artificial neural network which can learn and make intelligent decisions on its own.
Inference is the application of what has been learned to new data (usually via an application or service) and making an informed decision regarding the data and its attributes.
Artificial Neural Networks are computing systems inspired by the organic neural networks found in human and other animal brains, where nodes (artificial neurons) are connected (artificial synapses) to work together.
Training is the process of learning a new capability from existing data based on exposure to related data, usually in very large quantities.

Process

Choosing the AI model determines what data you want to ingress, what tools you use, which components are required, and how those components are connected. Once you’ve selected your AI model, picked your processing framework, and structured your data, you typically want to run a proof of concept (POC) for the learning phase of the project and likely a separate one for the inference portion (as the requirements for each are different), though that is not always possible due to cost. If the POC testing process works out, you want to move to production. Often, a successful program means scaling to gain even more value from your data.

Challenges

The hardware-related steps required to stand up a DL cluster each have unique challenges. Momentum for complex and resource-intensive initiatives can dissipate quickly so fast time to proof-of-concept is critical. But moving from POC to production often results in failure, due to additional scale, complexity, user adoption, and other issues. Want to scale? You need to design that capability into the hardware at the start.

Specific workloads need specific customizations. You can run ML on a non-GPU-accelerated cluster, but DL typically requires GPU-based systems. And training requires the ability to adequately support ingest, egress, and processing of massive datasets.

One of the most important benefits of using a hardware designer that specializes in custom clusters is the ability to optimize performance for your workload. And, with some of the great new technologies out there, your AI cluster can do a lot more today than even just a few years ago, without busting the budget. However, you don’t want your optimization efforts to exhaust your hardware resources.

Resources

AI Infrastructure Decision Guide

Not every organization is at the same stage of AI adoption. Read this document to better understand the stages of AI-readiness and what level of investment is best suited to where you are in your AI journey.

Read Document

Guide to Infrastructure Requirements for AI Inference vs Training

Inference and Training both fall in the category of AI but what they are, and the hardware they require, is very different. Read this document to learn some key differences in what makes an inference solution vs. a training solution.

Read Document

Featured Reference Architectures

Silicon Mechanics Atlas AI Cluster

A powerful system architecture designed from the group up to optimize large AI datasets. This Linux-based cluster design includes best-of-breed technology, including the NVIDIA® HGX™ and AMD EPYC™

Learn More

Expert Included

Our engineers are not only experts in traditional HPC and AI technologies, we also routinely build complex rack-scale solutions with today's newest innovations so that we can design and build the best solution for your unique needs.

Talk to an engineer and see how we can help solve your computing challenges today.