Introduction to Groq
A fresh face has emerged in the AI hardware arena, Groq, ready to shake up the status quo with its innovative AI chip, the Language Processing Unit (LPU). Boasting a staggering 500 tokens per second, this new player aims to redefine performance standards in AI.
The brains behind Groq are seasoned veterans from the AI/Machine Learning field. Notably, Jonathan Ross co-founder of the Tensor Processing Unit (TPU) at Google, a cornerstone in machine learning acceleration, leading the company.
Identifying the Problem
But what problem is Groq trying to solve? Services like OpenAI, Google Gemini, and Meta’s Llama series have set high standards in inference capabilities, focusing primarily on accuracy, problem-solving, and reasoning. Yet, one crucial piece has been missing from the puzzle – the speed of inference. Groq steps in with its groundbreaking chip architecture to address this very issue.
Don’t just take my word for it; Groq has unveiled a cloud service coupled with AI tools for a hands-on experience, including a free trial that allows us to explore it’s capabilities. That is what we will do today. So let’s take a look at Groq’s ecosystem.
Groq’s Ecosystem
Recently debuting with a suite of Python & JavaScript libraries, Groq simplifies the integration of its services, mirroring OpenAI’s API definition to ensure a seamless start & testing within existing LLM apps.
Integration tools like Langchain & LlamaIndex are the icing on the cake, providing a smooth transition for building and integrating LLM applications with RAG capabilities.
Diving In: Setting Up
Time to roll up our sleeves. We’ll kick off by setting up our environment. I’m partial to venv for these smaller tutorials, though feel free to opt for Conda or Poetry. Here’s how to get started:
python -m venv venv
source ./venv/bin/activate #mac
.\venv\Scripts\activate #windows
Now let’s install the dependencies needed to run our experiments.
pip install groq openai python-dotenv
API Keys and Secret Management
For this example we will need two API keys. If you haven’t done so, sign up for an OpenAI & Groq account.
Once created you can create API keys from the following links:
- Groq – https://console.groq.com/keys
- OpenAI – https://platform.openai.com/api-keys
I advocate for using python-dotenv to manage secrets within an .env file, a safeguard against inadvertent leaks. Go ahead and create your .env file at the root of your project and add two entries for GROQ_API_KEY & OPENAI_API_KEY.
The Litmus Test: Basic Inference
Before I do a side by side comparison, let’s engage with a light-hearted example. Here’s how you can prompt the model to tell us a joke:
"""Quick Prompt Module
"""
import os
import time
from groq import Groq
from dotenv import load_dotenv
load_dotenv()
client = Groq(
api_key=os.environ.get("GROQ_API_KEY")
)
def main():
prompt = "Tell me a joke"
start_time = time.time()
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": f"{prompt}"
}
],
model="mixtral-8x7b-32768"
)
print(f"Prompt:\n {prompt}\n")
print(f"Response:\n {chat_completion.choices[0].message.content}\n")
end_time = time.time()
print(f"Total Inference Time Plus Network: {end_time-start_time} seconds")
if __name__ == "__main__":
main()
You should also have noticed that the response can back in under a second. That’s fast but let’s get into a more intricate challenge.
The Face-off: Groq (Mixtral) vs OpenAI (GPT-4)
Our setup is primed to conduct a side-by-side analysis of Groq’s Mixtral against OpenAI’s GPT-4 on an identical prompt. This is where the rubber meets the road:
"""Quick Prompt Module for Groq
"""
import os
import time
from openai import OpenAI
from groq import Groq
from dotenv import load_dotenv
def main(platform, client, model):
print(f"Performing inference using {platform}")
system_prompt = """
You are a helpful building contractor with years of experience.
You will tailor your responses to someone who is proficient with DIY but not a professional contractor.
You will use layman's terms when describing construction concepts & tools
"""
prompt = """
I am planning to renovate my garage and am looking for some tips.
Step by step provide me with the list of tasks and materials I will need to complete this project
"""
start_time = time.time()
chat_completion = client.chat.completions.create(
messages=[
{
"role": "system",
"content": f"{system_prompt}"
},
{
"role": "user",
"content": f"{prompt}"
}
],
model=model,
)
print(f"Prompt:\n {prompt}\n")
print(f"Response:\n {chat_completion.choices[0].message.content}\n")
end_time = time.time()
print(f"Total Inference Time Plus Network: {end_time-start_time} seconds")
if __name__ == "__main__":
load_dotenv()
groq_client = Groq(
api_key=os.environ.get("GROQ_API_KEY")
)
main("Groq", groq_client, "mixtral-8x7b-32768")
openai_client = OpenAI(
api_key=os.environ.get("OPENAI_API_KEY")
)
main("OpenAI", openai_client, "gpt-4-0125-preview")
The Results – Is It Really That Fast?
Platform | Model | Response Tokens | Time (seconds) | tokens/s |
Groq | mixtral-8x7b-32768 | 696 | 1.887 | 368.84 |
OpenAI | gpt-4-0125-preview | 713 | 31.167 | 22.87 |
The showdown doesn’t reveal drastic differences in response tokens, but the difference in inference speed is obvious. Performance in these tests — nearly 370 tokens/s — starkly contrasts with GPT-4’s 23 tokens/s, showcasing Groq’s formidable advantage. Its not the 500 tokens/s reported but it’s still blazingly fast at nearly 16x the response time of GPT-4.
Understanding Pricing
Groq’s pricing strategy caters to various needs, offering a free tier and three paid tiers influenced by model complexity and speed. The free tier is rate limited with the rate limit being set to 20 requests per minute and 25 requests per 10 minutes, operating simultaneously. Whichever limit is reached first triggers the rate limit. It’s worth noting that at the time of writing they are only supporting a small number of open source models but offer to discuss using other models on request.
Model | Current Speed Price per 1M Tokens (Input/Output) |
Llama 2 70B (4096 Context Length) | ~300 tokens/s $0.70/$0.80 |
Llama 2 7B (2048 Context Length) | ~750 tokens/s $0.10/$0.10 |
Mixtral, 8x7B SMoE (32K Context Length) | ~480 tokens/s $0.27/$0.27 |
Conclusion: A New Era for AI Inference
Groq’s entry into the market signifies a leap in addressing a long-overlooked aspect of AI performance – inference speed. Its results in my tests are nothing short of groundbreaking, marking a significant stride towards making real-time AI a tangible reality.
Feel free to tinker with the code and try your own prompts. Groq is not just knocking on the door; it’s bulldozing its way into the future of AI with a turbo charged engine and a battering ram knocking on the door of the competition. I can only hope that you already have some ideas for how you’ll deploy this new technology. Reach out, I’d love to hear them.