Article
4 min read
Dan Danciu

Many discussions on AI over the past year have encompassed a lot of enthusiasm for the technology’s capabilities as well as apprehension regarding its ethical and societal implications. Business executives, employees and, in some cases, customers have influenced companies on the need to integrate AI into their solutions. As an IT service provider, we’ve engaged in numerous conversations with clients on the topic of AI over the past 8-plus years. While many recent discussions have revolved around Generative AI (GenAI), other established solutions such as deep learning and machine learning have also experienced significant growth due to the widespread interest and focus on AI. 

 

Decision makers have expressed concerns about a lack of understanding regarding the landscape and the specific requirements for efficiently running AI workloads, commonly referred to as AI Infrastructure. To address these concerns, reports like this one from Forrester provide insights into the various players and their offerings. However, they offer limited usable information on how companies adopt and often combine multiple providers to deliver business value. 

 

Drawing from our extensive experience working with diverse companies across multiple industries, we have identified a consistent trend in how organisations approach AI Infrastructure. 

 

SaaS and cloud offerings

 

The majority of companies, especially those lacking experience and aiming for efficiency, opt for Software as a Service (SaaS) solutions. Typically, they turn to the main mega-cloud providers, as it is a natural expansion of already consumed services. However, it is worth noting that there are other players in the market, such as DataRobot or Dataiku.  

 

This approach offers several advantages, including the ability to forecast costs and shield oneself from the complexities associated with setting up and maintaining your own AI Infrastructure. Those who have attempted to install CUDA drivers (for NVIDIA GPUs) and align driver versions with specific versions for ML frameworks can attest to the challenges involved.  

 

Furthermore, this approach ensures that best practices are already in place for crucial aspects like data asset versioning and experiment tracking, allowing companies to learn and adapt as they progress. 

 

On-premise solutions

 

As experience accumulates and AI workloads expand, often driven by cost considerations, companies may transition to on-premises resources. Initially, they may offload the round-the-clock tasks or solely allocate GPU compute for data scientists' work to reduce wait times. Given that a quality GPU in the cloud starts at approximately $4 per hour, the financial benefits of this shift are evident.  

 

However, managing the new infrastructure introduces additional costs and necessitates expertise that organisations may need to acquire through specialised hiring or engaging service providers. Such resources are usually shared across 2-3 teams to optimise spending, but data governance (especially access control) is almost always non-existent, and everybody sees everything from a data perspective. 

 

As workloads continue to grow and the expenses associated with dedicated infrastructure for teams or departments become substantial, decision-makers typically invest in cross-company solutions. This allows them to strike a balance between their GPU-time needs and resource availability. In many cases, on-premises resources are supplemented with cloud, and sometimes even multi-cloud, to allow for demand spikes and strike a balance between cost and immediate availability. It is important to highlight that while such a solution offers significant benefits, it is crucial to recognise the much higher complexity; things worth mentioning are: 

 

  • How large volumes of data are made available efficiently for the GPU during training? 
  • Infrastructure as code is no longer optional and has become mandatory.  
  • How MLOps practices and tools cater to the needs of different departments? 
  • Implementing observability and traceability for maintainability, AIOps and even cost allocation.
Although this journey mainly focuses on building and evaluating ML models, there are other complexities to consider. For large enterprises, as these models are incorporated in the decision-making part of the business process, model deployment and inference services become considerable matters to solve, inference has its own set of particularities. For consumer electronics companies or providers of original equipment manufacturer (OEM) solutions, where each target device has one or more models, the complexities go far beyond. 

 

Large language models

 

There is pressure on companies to leverage Large Language Models (LLMs) and other generative solutions. Some of the pressure is driven by the excitement of new possibilities, while others are driven by fear of being left behind. The interest in AI Infrastructure providers and associated practices has surged. Even the companies that have been playing in this space for the last half-a-decade, the move to incorporating generative solutions has come with difficulties because existing tools and practices do not function well with the particularities of LLMs and other generative solutions. For companies that have little or no experience in building and running AI, it is extremely difficult to get up to speed fast given the lack of established practices, patterns and multitude of solutions that are natural in such a novel field. 

 

This is the reason why IT service providers that have a demonstrated history of successful projects involving AI and experience across various industries (including those that quickly embrace new technologies) are in a favourable position to help organisations use AI and generative AI as integral components of their business. If you need help with your AI infrastructure or wish to speak to experts in the field, please feel free to contact us to discuss your challenges further. 

 

No video selected

Select a video type in the sidebar.