GenAI Models: Cost and Performance Insights for Busy Executives

A quick primer on GenAI

Generative AI is a subset of artificial intelligence that leverages deep learning techniques like neural networks to generate statistically probable outputs when prompted. Simply speaking, GenAI models are able to produce new content across such key modalities as text, image, code, video, audio, based on patterns learned from existing data.

Key generative AI modalities

Harnessing the power of Generative AI could revolutionize productivity, unlocking trillions of dollars in value for the global economy. McKinsey’s latest research suggests that this transformative technology has the potential to inject between $2.6 trillion to $4.4 trillion annually into various sectors.

Key factors affecting the choice of GenAI implementation approach

1. Paying for tokens or for infrastructure

Once again, we find ourselves grappling with the good old open-source vs proprietary dilemma. Many powerful GenAI models like Meta’s Llama and TII’s Falcon are open-source, and the number is growing by the day. Just a few weeks ago Elon Musk announced his plan to open up the source code behind Grok, a massive 314B parameter language model.

Having access to the source code, you can not only fine-tune your GenAI model according to your needs but also deploy it locally on your own server. This way you can avoid per-token pricing and save on inference costs. However, if you opt for an open-source model, you will need to invest a pretty penny in the infrastructure. To support a GenAI model like Llama, you will need a GPU with at least 24GB RAM (60GB is more realistic), which can cost around $2,000 per month.

To put this number in perspective, we pay about $100 per month for using ChatGPT’s API for our Symfa GPT bot to get ad-hoc reports, which works great for our current limited usage level. The caveat here is that if your usage levels ramp up and you are dealing with large amounts of data, going for commercial GenAI APIs could have paid off within 2 years.

2. Protecting data privacy

Another crucial point here is data security and privacy. Open AI, for instance, claims not to use your data for training purposes but ChatGPT still saves account information and user content, including prompts and uploaded files. Earlier this year, the Italian data protection authority found that ChatGPT violated EU’s GDPR regulations based on its investigative findings. So no wonder that large-scale enterprises, particularly those in industries flooded with sensitive data like healthcare, banking, insurance, and finance, can be reluctant to put their trust in proprietary models.

3. Choosing the best model vs finding the perfect match for the task at hand

Everyone wants to use the best GenAI model, however “best” is often subjective depending on the task at hand. Opting for a bulky and complex LLM for a simple task might be an overkill. The right approach would be to compare models evaluating their type, performance, latency, and then selecting the one that best aligns with the requirements at hand.

Hugging Face, the AI community's open-source repository, has introduced a collection of different leaderboards to benchmark and rank the newest and shiniest GenAI models, serving as a sanity check against hype and buzz. Have a look at their most popular Open LLM leaderboard:

Screenshot 1 Llm Leaderboard

As you can see, Hugging Face tracks different metrics that are calculated based on EleutherAI’s Language Model Evaluation Harness. Simply choosing the model listed at the top of the leaderboard may seem like the obvious choice, but it’s not always a wise one because this ranking is based on a straightforward average across all tasks. Yet, this approach could mask weaknesses in tasks that hold the utmost importance for your unique requirements. Here’s a quick rundown on most important performance metrics:

ARC (AI2 Reasoning Challenge) consists of grade-school level, multiple-choice science questions and assesses the model’s ability to apply logical inference, deduction, and the integration of concepts.

HellaSwag evaluates commonsense reasoning and the ability to predict or complete a sentence in a way that makes most sense.

MMLU (Massive Multitask Language Understanding) evaluates a model’s acquired knowledge during pretraining across a wide array of subjects and disciplines, from STEM the humanitarian sciences.

Truthful QA evaluates a model’s ability to provide truthful and factual answers. The test is specifically focused on questions that some humans may answer incorrectly due to common misconceptions and bias.

Winogrande is based on the original Winograd Schema Challenge (WSC) and aims to assess a model’s reasoning capabilities by challenging a model to understand the context and underlying meaning of pronouns.

GSM8K (Grade School Math 8K) assesses a model’s ability to apply mathematical reasoning and understand a textual description to solve school-grade math problems.

In addition to Open LLM Leaderboard, Hugging Face maintains other leaderboards with fine-grained rankings on different task types. This is, for instance, a snippet of their Massive Text Embedding Benchmark (MTEB) Leaderboard that ranks models on a variety of text embedding tasks:

Screenshot 2 Mteb Leaderboard

Are you looking to build an app with speech recognition capabilities? The Open ASR Leaderboard ranks different speech recognition models evaluated on multiple datasets to assess performance on a range of different conditions. That’s where you can see that although OpenAI/Whisper has the best “word error rate”, Nvidia/Canary has much better real-time performance.

Screenshot 3 Asr Leaderboard

Ok, these rankings are a great starting point. But if you want to move from theory to practice and get your hands on some models to see their capabilities, there are services that let you do that.

Check out OpenRouter, for example. This is sort of a hub that gives you access to different models, both proprietary and open-source, without the need to handle their deployment on your servers. OpenRouter claims that its standardized API allows users to switch models without the need to change your request. Sounds great although you might still need to tweak the request and add new conditions as different models work differently. To keep up with the dynamic tech landscape, these services can also develop additional functionality. OpenRouter, for instance, introduced multi-modal OpenAI-like function calling allowing developers to describe functions and then having the models define places where the functions needed and their parameters.

Long story short, services like OpenRouter are a good and rather cheap way to test your idea for a GenAI-powered solution. If someone, like OpenRouter in our example, managed to implement the functionality you need, then you can deploy this model within your own infrastructure, fine-tune it or add new functionality according to your needs.

Summing it up

Having worked on our own GPT-powered bot, I can say with confidence that starting with a well-maintained proprietary model like OpenAI’s GPT-4 is a cost-effective way to test your MVP and make sure that the solution actually brings value. Otherwise, you risk investing too much upfront only to find out that the results don’t meet your expectations. Here’s a quick summary:

Proprietary Models Make Sense When

P.S. Don’t worry, this article wasn’t written by ChatGPT. The banner, however, was generated by Envato’s AI ImageGen.

Stay tuned! Follow us on LinkedIn and X to be the first to know about our updates. We post regularly about the latest tech trends and share honest software development backstage stories.

Crunching numbers: What busy executives need to know about the cost and performance of GenAI models

Learn what you need to kickstart your own GenAI efforts