Cost of AI - Build, Buy or Customize

part one - prompts, RAG, fine-tuning

Feb 09, 2024

I wrote about this just six months ago, in August 2023, and field has advanced so much that many of the assumptions made at that time may not be applicable anymore. The world has seen the release of LLaMa 2 as API via Amazon Bedrock and Amazon SageMaker, newer models from Anthropic, general availability (GA) of Amazon Bedrock, and the launch of Mistral's 8x7B model, which has completely changed the economics in terms of the size of the models, among other developments.

Let's revisit that question in 2024: should large enterprises build their own AI or buy it? This is a two-part series. In this first part, I will cover whether to buy, build, or customize.

Before we dive deep into this, let’s baseline our understanding of build vs. buy.

Buy: Like a SaaS offering, buy API services as a subscription/pay-as-you-go (services such as Amazon Bedrock, chatGPT APIs)
Build: You can build it from scratch or leverage open architecture and build on top of it (can do it on your own GPUs or Amazon EC2 or prefer managed option than Amazon SageMaker, etc.).

While traditionally, companies have been making decisions around build vs. buy, a third category is emerging when we are looking at AI, particularly for enterprise segment - is “tune” or “customize”.

When making decisions about whether to build, buy or customize, consider the following factors::

$$$: Every enterprise is in the business of making money. How can you make money using Generative AI should be the first question you have to answer. What is the investment you can do and what’s the ROI - short term and long term. How will it impact either your top line (e.g. new revenue, new products/services) or at least a bottom line (e.g. productivity gain, operational efficiencies).
time-to-market: how soon you would like to see the functionality gets deployed to production (not proof-of-concept). Is your CEO and board asking you to be ready by next week or next month?
skills: does your team have required science skills or would you be able to quickly hire/outsource it? You don’t need hundreds of them but you do need few. If BloombergGPT and Falcon LLMs are any clues, their teams included less than 10 people (including leaders!)
IP: Do you want to own the intellectual property of the AI model? Most enterprise customers I talk to, this is the biggest factor in decision making.
Security and Privacy: You may not want to expose any of your data to 3rd party model providers due to security and privacy reasons. (biased alert: Amazon Bedrock or Amazon SageMaker do not use your data to train the models and it’s always isolated per customer - so your data privacy and security remains intact.)

Above chart provides a very high-level overview that as you go from zero shot/few shot to fine-tuning and pre-training the model, the accuracy will be higher, so will be the cost.

Let’s quickly review what it takes to build and buy first in today’s market, then we will jump into “tune” or “customize”.

Buy:

Let’s take a scenario of you buying a service as a subscription pay-as-you-go.

A word of caution before you decide to go this route: when you use third-party service to perform your generative AI operations/tasks (e.g. summarization, text generation), you are required to send certain information (e.g. your organization’s data) in form of a prompt to the provider. Not all the providers will keep your data private, by default. They may use the data you send to perform various generative AI tasks to re-train their foundation models. As an enterprise, you want to be cautious of what data you expose as many times it may contain your own proprietary information. So review their EULA (end user license agreements) to ensure that your data remain yours!

Ok, with security covered, let’s first see how one would use the subscription model. It typically works based on the variety of tasks you want to perform. For example, your users want to summarize a large article as they only want to read the summary of the article and not the whole. In that case, you build a mechanism for your users (or expose them to the third-party directly) to send the article to the model/service and within few seconds, model will send them a summary version of that article. This technique is also knowing as "prompting".

Let's assume that your users are required to review various financial documents, such as public companies quarterly or annual reports? As per American Accounting Association, mean annual reports are 55,000 words (~ 75K tokens). Assuming that as a part of summary, model is giving us 20% of the text (11,000 words or ~15K tokes) we are looking at costs around $0.06 to summarize one of the annual report. There are 58,200 publicly listed companies in the world. If your users need to summarize all of their annual reports, it will merely takes approx. $3,492. Do they also need to summarize quarterly reports? let’s add another $2,500 for additional 3 quarterly reports (assuming quarterly are 1/4 the size of annual).

Consider users may also need call transcript summarization and sentiment analysis from management discussions to understand the emotional tenor and key points. Typical earning calls are 45 to 60 minutes long and usually 7,500-10,000 words are spoken. By going similar math above, for all the public listed companies, it would cost approx. $625 per quarter for summarization and another $625 for sentiment analysis.

For approximately $7,000, you are able to let your users summarize entire corpus of the world’s financial reports and summarize/generate sentiments on earning call transcripts - just by using generative AI as a service — without building your own models. Though remember this is one time activity per user. If more users do the same thing or if you need to keep doing this few times a year, multiple it by $7,000.

Side bar: Now you may say that if the company in your example is in the financial domain, most likely they are using Bloomberg Terminal and thankfully Bloomberg now do offer AI-powered earning call summaries at no cost to their Terminal users. Yes, I am aware of it and it’s a solution only for call transcription and not for summarizing annual reports. Let’s get back to the main story.

Build:

Now, let’s take a scenario of you building a foundation model.

Building/Training large-scale foundation models from scratch or pre-training them requires oceans of data, computing power, and financial resources.

When it comes to price for building/pre-training a model, we need to consider multiple factor: 1) fixed price (to train the model), 2) variable price (to serve the model), 3) time $ value of skilled scientists and engineers to build and evaluate such models, 4) ability to process the data or acquire the data.

Building a custom foundation model similar to LLaMA 2 for summarization and sentiment analysis of public companies could cost around $3.8 million in fixed expenses, $4,000 in variable inference costs, and $1 million for staffing, totaling approximately $4.8 million. (all of the calculations and assumptions on building LLaMa 2 are still somewhat valid from August 2023, so if you are looking for in-depth analysis on how I derived these numbers, please review my article again)

However, the benefit of building a powerful general-purpose model like LLaMA 2 is that it enables numerous additional tasks beyond those initial use cases, requiring only incremental variable costs for more inferences. In this scenario, periodic retraining of the model would likely be necessary to keep it current and may incur additional expenses. The substantial upfront investment for developing a custom foundation model can be justified if the model is versatile enough to amortize the fixed costs across many production applications.

(The above estimates do not account for any data processing requirements, as those can vary widely depending on each organization's unique data and workflows.)

Customization:

Let’s understand why any company want to customize the model — instead of buying (via SaaS/API offering) ?

Most companies possess their own data, which are (hopefully) never used to train any open-source or proprietary models. When they aim to apply their domain-specific knowledge as part of their Generative AI workflow, merely using a SaaS offering does not yield the accuracy they desire. Even after multiple iterations of prompt engineering and using context lengths of 128K or 200K, accuracy still remains a question mark for many of their tasks.
While Generative AI enhances productivity, many companies have realized that it can impact not only the bottom line but also has the potential to improve their top line by generating new revenue or building new products. Often, to own the end-to-end value chain, they prefer to own the intellectual property (IP) of core components, with LLMs being one of them. However, by leveraging the SaaS model for using LLMs, they are nowhere near owning that IP.
Companies operating in competitive environments prefer to provide best-in-class service and differentiate themselves. If every company uses one or two LLM providers, there is very little to differentiate them and stay competitive. Customizing the LLMs with their own data can lead to differentiation, but to remain competitive and best-in-class, they must adopt varying degrees of customization.
They are worried about potential bias, hallucinations, and other aspects of responsible AI when they leverage open-source or proprietary models. While many of these model providers are leveling up their game in terms of LLM safety and trust, there is still a lot of room for improvement.

RAG:

To overcome some of these challenges, many companies have started adopting Retrieval Augmented Generation (RAG), where they can still leverage the power of large LLMs but augment their knowledge with their own data. RAG is one way to provide customization for these generative AI applications. Other advanced RAG techniques, such as agents and hybrid search, have come to fruition, though adoption has been slower.

Let’s dive into very basic primitives of cost associated with RAG.

The picture above shows that to perform Retrieval Augmented Generation (RAG), we first need to use an embedding model through which we pass our proprietary data. Once the embeddings are created, we store them in a specialized datastore, called a vector database. For production applications, when users ask a question, the context is derived from the vector store. This context, combined with your query, creates a prompt that is then passed to an LLM, where the LLM can use its generational skills to provide a response. There are pros and cons to this approach, but we won’t delve into those details here. However, returning to the topic of cost, to operate this kind of pipeline involves: 1) the cost of creating the embeddings via the embedding model, 2) the storage cost for the vector database, and 3) the cost of an LLM (here, we are assuming that we are still using the SaaS/API approach and purchasing that service).

In the earlier example of yearly reports, the annual reports for those 58,200 companies would amount to approximately 5 million tokens. Assuming we want to apply documents of similar size for our RAG workload, it costs $0.50 to generate embeddings (using OpenAI's ada-2 model) and $120/month to store them in a vector DB (like Pinecone). Based on the number of queries to the LLM, we still need to account for LLM usage costs. Let's stick to our earlier case where we were spending $7,000 on prompt workloads, and even with the RAG approach, we will generate an equal number of prompts. So, summing it up, for approximately $8,500/year, we are able to apply our own dataset and ask questions with higher accuracy. Even if we update our data nightly, requiring the recreation of embeddings, it will add another ~$180 to the bill.

To summarize:

simply using services in SaaS/API via prompting: ~ $7,000 (accuracy can be lower)
adding RAG mechanism: ~$8,500 (higher accuracy compared to prompts as we are exposing the information model has never seen)

While RAG is simpler way of customization, that’s used for limited set of applications. If company want to own the IP or improve their top line, they want to fine-tuning or in some cases completely retrain the model.

Fine-tuning:

So what it takes to do fine-tuning? Before we dive deeper into fine-tuning, let’s set the basic definition:

Fine-tuning a Large Language Model (LLM), such as LLaMa 2, is the process of taking a pre-trained model and further training it on a smaller, specific dataset to adapt the model to a particular task or to improve its performance/accuracy on tasks related to the specifics of that dataset. In our case, we would want this model to be fine-tuned for our company's datasets. This process leverages the general knowledge the model has already acquired during its initial extensive training phase (remember LLaMa 2 model was trained for 1.7M GPU hours with 2 trillion tokens) and narrows its focus to become more proficient in a particular domain or more aligned with our specific requirements. The scaling property for LLM fine-tuning is highly task and data dependent making a selection of optimal fine-tuning method for downstream task non-trivial but most commonly used are parameter-efficient fine-tuning (PEFT) strategies, such as Low-Rank Adapters (LoRA).

Let’s get back to the math. While LLaMa 2 was the highlight of the last quarter of 2023, its thunder was taken over by Mixtral-8x7B. This model is a Mixture-of-Experts (read more about MoE in my past article) which is a relatively smaller model and requires less compute power for inference while still meeting or exceeding the benchmarks of LLaMa 2's 70B for a multitude of tasks. So, let’s find out how much it costs to fine-tune Mixtral-8x7B.

Our friends at Huggingface have done some baseline that it takes 48 GPU (A100 - 80G) hours to fine-tune Mixtral-8x7B model with approximately 5M tokens. The cheapest I could find A100 is from Lambda Labs at $1.79/hr. For 48 GPU hours, it will cost us ~$86. Let me reiterate, it will cost us ~$86 to fine-tune a model with 5M tokens (our yearly report example in earlier scenario also has 5M tokens). Since this is a fine-tuned model, it can’t be run as a SaaS service but we will need to host the model to perform our application task - it’s called model inference

side bar: Both OpenAI and Amazon Bedrock offers fine-tuning of models and hosting them but only for models they support. Ideally, for our example Mixtral-8x7B, I prefer to use Amazon SageMaker to do fine-tuning as well as hosting - as it takes away lot of pain points associated with creating and scaling infrastructure.

While MoE models can do inference very fast, it requires large amount of VRAM. In case of Mixtral-8x7B requires 90GB of VRAM in half-precision - so we will need two A100-80G GPUs to support this for inference. If we do simple math, at the same GPU price, it will cost us $31,360/year to do the inference.

Let’s summarize it:

for tuning the 5M token it will take ~ $86 (mostly one time cost)
for hosting the model so that our applications can use it is $31,360 (for the entire year) and since we are hosting the model, we can perform various tasks (summarization, sentiment analysis, translation etc.) supported by model “n” number of times and not pay anything extra.

Now, let’s compare those 3 options:

back-of-the-envelop calculation for prompting vs. RAG vs. fine-tuning

While the above is a back-of-the-envelope calculation and does not include many other aspects of cost—such as the time and skills of a data scientist/ML engineer, data preparation costs—it is still compelling to see that if you are looking for higher accuracy, it is not that costly (it doesn’t cost you millions) to fine-tune the models. Also, if you use any of the open-source models with permissive licenses (like MIT or Apache 2.0), you can literally own the IP of the fine-tuned models.

While most enterprise will not have 5M tokens to customize/fine-tune a model. Their datasets are going to be in the range of 10s to 100s of billions of tokens. What would it take to use that kind of large datasets - does it still make sense to use prompting? or should they consider fine-tuning or even creating a model from scratch? Let's review that in the next article. Stay tuned!

Shameless plug:

Do you like what I write on this topic? In my "AI for Leaders" course, I cover topic exactly like this, on what leaders have to focus when adopting AI in an enterprise. Along with the cost, course also cover pertinent topics like:

How to assess current state of AI
AI strategy framework
Risk and Regulation
Defining AI vision for your company
Organization and talent strategy
Implementation guidelines
and much more..

My February cohort is scheduled to begin on 14th Feb, 2024. Please sign up and immerse yourself in the world of AI. Whether you're an executive, a manager, or an aspiring leader, this course will empower you to be at the cutting edge of AI leadership.

Don't just watch the AI transformation unfold – be a part of it!

AI Explained

Discussion about this post