The State of AI LLMs in Q2 2024

The artificial intelligence revolution is in full swing, with large language models (LLMs) leading the charge and creating new, transformative possibilities for startups and enterprises. As we navigate Q2 of 2024, the LLM landscape is vast and rapidly evolving, offering a wide array of models tailored to specific use cases, budgets and integration needs. From industry giants to emerging players and open-source options, selecting the right LLM can be a game-changer for automating processes, enhancing customer experiences, or opening up new business opportunities. This comprehensive guide explores the capabilities, strengths, and limitations of leading models, hosting options, and key considerations for making an informed decision. Let’s explore what’s on offer.

Evaluating the Top LLM Contenders

OpenAI’s GPT-4: The Powerhouse for Prompt Engineering and AI Assistants

GPT-4 by OpenAI remains a powerhouse that excels in prompt engineering, role-playing scenarios, and specialised assistance tasks. Its ability to follow instructions via system prompts and provide structured output (JSON, XML) makes it a compelling choice for AI Assistant scenarios and AI-assisted workflow automations. However, its predecessor GPT-3.5, is becoming less attractive due to its pricing and output quality compared to newer models.

Anthropic’s Claude 3 Series: Balancing Quality and Cost-Effectiveness

Anthropic’s Claude 3 series has emerged as a strong contender. The Opus model delivers exceptional, human-like content generation, but it comes with a premium price tag (avg €41.3 per 1M tokens vs €19 for GPT-4 Turbo). From our experience, the Claude 3 Sonnet model offers a sweet spot – similar quality to GPT-4 at less than half the cost. Sonnet excels in content generation, making its way through follow-up messages, and can provide structured XML output when instructed properly. It’s worth noting that Sonnet only allows one system prompt and requires a strict sequence of human and assistant messages, creating some challenges when used in AI-workflows.

If cost is a concern but quality is still of great importance, the Claude 3 Haiku model is a standout. It delivers impressive human-like content at an amazing price point and lightning-fast speeds. While it may not match Sonnet’s level in every scenario, Haiku shines in process automation tasks like summarization, key point extraction and triaging notes. Its 200k context window is a game-changer, and it works best when provided with enough examples of input and output, a kind of “on-demand fine-tuning” that allows much better results.

Google’s Gemini Models: Massive Context Windows enabling new Use Cases

Google’s Gemini models are worth considering too. The Gemini 1.0 Pro offers GPT-3.5-level quality with the added benefit of Google service integration. But the real star is the Gemini 1.5 Pro and Gemini 1.5 Flash, with its massive 1M (soon 2M) context window, enabling new use cases like retrieval-augmented generation (RAG) and style transfer using examples. Gemini models have recently added the option to provide system instructions (similar to a system prompt), enabling more use cases and user-facing applications. Early experiments show it is not as easy to shape into a persona as GPT-4 or Claude 3, but it may change as we learn more how to prompt it correctly.

Cohere’s Command Model: Grounded and Sourced Outputs for Factual Accuracy

For those seeking grounded and sourced outputs, Cohere’s Command model, trained specifically for retrieval-augmented generation (RAG) use cases, provides that out of the box. This model is particularly useful for scenarios where factual accuracy and transparency are paramount, as it can cite the sources it draws upon. While open-source versions of Command are available for local inference, they have minimum hardware requirements, which make the hosted options a better fit for most scenarios.

It’s also worth noting that Cohere models are currently the only LLMs with straightforward API access and hosting that have native multi-language support. While GPT-4 and Claude 3 can speak many languages, the models were built with English in mind, so queries in languages that use different alphabets can cost 3 to 4 times more than the English equivalent. In contrast, Cohere Command only costs ~1.5 times more in the same languages.

Open-Source Models for Local Inference: Llama 3 and Fine-Tuned Variants

In the realm of open-source models for local inference, Llama 3 has emerged as a standout choice for private RAG tasks, such as reading over notes, summarising, and answering questions. As it is open-source, there are already fine-tuned versions offering up to a 4M context window. This large context window allows the model to process and understand more information, making it well suited for tasks that require in-depth analysis or synthesis of multiple sources.

The Quest for On-Device Inference: Balancing Size, Speed, and Quality

As the demand for on-device inference grows, several models are looking to strike the perfect balance between size, speed, quality, and low hardware requirements. Models like Phi-3-mini (3.8B), Gemma (2B), Qwen (1.8B and 0.5B), StableLM (1.6B), and Gemini Nano (1.8B and 3.25B) are all contenders in this space. While achieving this balance is no easy feat, the sweet spot seems to be around the 2B token range, as models of this size can run smoothly on devices with as little as 4GB of RAM.

It’s worth noting that while smaller models may be more resource-efficient, they often sacrifice language understanding and knowledge capabilities. Striking the right balance between model size and performance is an ongoing challenge, and organisations must carefully weigh their specific needs and constraints when selecting a model for on-device inference.

Hosting Options for LLMs

Major Cloud Providers: Azure AI, AWS Bedrock, and Google Cloud Vertex AI

When it comes to hosting LLMs, organisations have a wide range of options from major cloud providers and specialty platforms. Besides the original model providers (OpenAI, Anthropic, and Mistral) which offer shared infrastructure hosting, it is now very easy to get a private instance of a model to ensure control over your data.

Azure AI offers private access to OpenAI models such as GPT-3.5, GPT-4, Whisper, and DALL·E, as well as open-source models: Llama (v2 and v3), Mistral, and Cohere, all with token-based pricing. Additionally, Azure AI provides other models with provisioned hosting on a time-based pricing model. 

AWS Bedrock is another popular choice, offering private instances of Anthropic’s Claude models (Opus, Sonnet, Haiku), along with open-source options like Cohere, Mistral, and Llama, all with token-based pricing.

Google Cloud Vertex AI is also in the picture, with access to Google’s Gemini models (Gemini 1.0 Pro and Gemini 1.5 Pro), Anthropic’s Claude 3 series (Opus, Sonnet, Haiku), and other models with provisioned, time-based hosting.

Specialty Providers: Groq and

In addition to the major cloud providers, specialty providers like Groq ( and ( have emerged to cater to specific needs. Groq, for instance, focuses on delivering high-speed, low-latency AI inference, which can be particularly valuable for time-sensitive or compute-intensive applications., on the other hand, specialises in hosting open-source models with token-based pricing, making it an attractive option for organisations looking to leverage the power of open-source LLMs while benefiting from a managed hosting solution.

As the LLM landscape evolves super fast, startups and enterprises must carefully evaluate models to ensure innovation, streamline processes, and gain a competitive advantage. However, navigating AI’s complexities can be daunting without dedicated expertise.