Constructing AI Brokers with llama.cpp

June 29, 2025

2

Picture by Writer

llama.cpp is the unique, high-performance framework that powers many widespread native AI instruments, together with Ollama, native chatbots, and different on-device LLM options. By working instantly with llama.cpp, you possibly can reduce overhead, acquire fine-grained management, and optimize efficiency in your particular {hardware}, making your native AI brokers and functions sooner and extra configurable

On this tutorial, I’ll information you thru constructing AI functions utilizing llama.cpp, a robust C/C++ library for operating giant language fashions (LLMs) effectively. We’ll cowl organising a llama.cpp server, integrating it with Langchain, and constructing a ReAct agent able to utilizing instruments like internet search and a Python REPL.

1. Organising the llama.cpp Server

This part covers the set up of llama.cpp and its dependencies, configuring it for CUDA help, constructing the mandatory binaries, and operating the server.

Word: we’re utilizing an NVIDIA RTX 4090 graphics card operating on a Linux working system with the CUDA toolkit pre-configured. If you do not have entry to related native {hardware}, you possibly can lease GPU cases from Huge.ai for a less expensive value.

Screenshot from Huge.ai | Console

Replace your system’s package deal record and set up important instruments like build-essential, cmake, curl, and git. pciutils is included for {hardware} data, and libcurl4-openssl-dev is required for llama.cpp to obtain fashions from Hugging Face.

apt-get replace
apt-get set up pciutils build-essential cmake curl libcurl4-openssl-dev git -y

Clone the official llama.cpp repository from GitHub and use cmake to configure the construct.

# Clone llama.cpp repository
git clone https://github.com/ggml-org/llama.cpp

# Configure construct with CUDA help
cmake llama.cpp -B llama.cpp/construct 
    -DBUILD_SHARED_LIBS=OFF 
    -DGGML_CUDA=ON 
    -DLLAMA_CURL=ON

Compile llama.cpp and all its instruments, together with the server. For comfort, copy all of the compiled binaries from the llama.cpp/construct/bin/ listing to the primary llama.cpp/ listing.

# Construct all needed binaries together with server
cmake --build llama.cpp/construct --config Launch -j --clean-first

# Copy all binaries to principal listing
cp llama.cpp/construct/bin/* llama.cpp/

Begin the llama.cpp server with a unsloth/gemma-3-4b-it-GGUF mannequin.

./llama.cpp/llama-server 
    -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL 
    --host 0.0.0.0 
    --port 8000 
    --n-gpu-layers 999 
    --ctx-size 8192 
    --threads $(nproc) 
    --temp 0.6 
    --cache-type-k q4_0 
    --jinja

You’ll be able to check if the server is operating appropriately by sending a POST request utilizing curl.

(principal) root@C.20841134:/workspace$ curl -X POST http://localhost:8000/v1/chat/completions 
    -H "Content material-Kind: utility/json" 
    -d '{
        "messages": (
            {"function": "person", "content material": "Good day! How are you right this moment?"}
        ),
        "max_tokens": 150,
        "temperature": 0.7
    }'

Output:

{"decisions":({"finish_reason":"size","index":0,"message":{"function":"assistant","content material":"nOkay, person greeted me with a easy "Good day! How are you right this moment?" nnHmm, this looks as if an off-the-cuff opening. The person may be testing the waters to see if I reply naturally, or possibly they genuinely need to understand how an AI assistant conceptualizes "being" however in a pleasant means. nnI discover they used an exclamation mark, which feels heat and presumably playful. Possibly they're in a superb temper or simply attempting to make dialog really feel much less robotic. nnSince I haven't got feelings, I ought to make clear that lightly however nonetheless maintain it heat. The response ought to acknowledge their greeting whereas explaining my nature as an AI. nnI marvel in the event that they're asking as a result of they're interested in AI consciousness, or simply being well mannered"}}),"created":1749319250,"mannequin":"gpt-3.5-turbo","system_fingerprint":"b5605-5787b5da","object":"chat.completion","utilization":{"completion_tokens":150,"prompt_tokens":9,"total_tokens":159},"id":"chatcmpl-jNfif9mcYydO2c6nK0BYkrtpNXSnseV1","timings":{"prompt_n":9,"prompt_ms":65.502,"prompt_per_token_ms":7.278,"prompt_per_second":137.40038472107722,"predicted_n":150,"predicted_ms":1207.908,"predicted_per_token_ms":8.052719999999999,"predicted_per_second":124.1816429728092}}

2. Constructing an AI Agent with Langgraph and llama.cpp

Now, let’s use Langgraph and Langchain to work together with the llama.cpp server and construct a multi instrument AI agent.

Set your Tavily API key for search capabilities.
For Langchain to work with the native llama.cpp server (which emulates an OpenAI API), you possibly can set OPENAI_API_KEY to native or any non-empty string, because the base_url will direct requests domestically.

export TAVILY_API_KEY="your_api_key_here"
export OPENAI_API_KEY=native

Set up the mandatory Python libraries: langgraph for creating brokers, tavily-python for the Tavily search instrument, and varied langchain packages for LLM interactions and instruments.

%%seize
!pip set up -U 
    langgraph tavily-python langchain langchain-community langchain-experimental langchain-openai

Configure ChatOpenAI from Langchain to speak together with your native llama.cpp server.

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    mannequin="unsloth/gemma-3-4b-it-GGUF:Q4_K_XL",   
    temperature=0.6,
    base_url="http://localhost:8000/v1",         
)

Arrange the instruments that your agent will be capable to use.
- TavilySearchResults: Permits the agent to look the net.
- PythonREPLTool: Offers the agent with a Python Learn-Eval-Print Loop to execute code.

from langchain_community.instruments import TavilySearchResults
from langchain_experimental.instruments.python.instrument import PythonREPLTool

search_tool = TavilySearchResults(max_results=5, include_answer=True)
code_tool   = PythonREPLTool()

instruments = (search_tool, code_tool)

Use LangGraph’s pre constructed create_react_agent perform to create an agent that may motive and act (ReAct framework) utilizing the LLM and the outlined instruments.

from langgraph.prebuilt import create_react_agent

agent = create_react_agent(
    mannequin=llm,
    instruments=instruments,
)

3. Check the AI Agent with Instance Queries

Now, we’ll check the AI agent and likewise show which instruments the agent makes use of.

This helper perform extracts the names of the instruments utilized by the agent from the dialog historical past. That is helpful for understanding the agent’s decision-making course of.

def extract_tool_names(dialog: dict) -> record(str):
    tool_names = set()
    for msg in dialog.get('messages', ()):
        calls = ()
        if hasattr(msg, 'tool_calls'):
            calls = msg.tool_calls or ()
        elif isinstance(msg, dict):
            calls = msg.get('tool_calls') or ()
            if not calls and isinstance(msg.get('additional_kwargs'), dict):
                calls = msg('additional_kwargs').get('tool_calls', ())
        else:
            ak = getattr(msg, 'additional_kwargs', None)
            if isinstance(ak, dict):
                calls = ak.get('tool_calls', ())
        for name in calls:
            if isinstance(name, dict):
                if 'identify' in name:
                    tool_names.add(name('identify'))
                elif 'perform' in name and isinstance(name('perform'), dict):
                    fn = name('perform')
                    if 'identify' in fn:
                        tool_names.add(fn('identify'))
    return sorted(tool_names)

Outline a perform to run the agent with a given query and print the instruments used and the ultimate reply.

def run_agent(query: str):
    outcome = agent.invoke({"messages": ({"function": "person", "content material": query})})
    raw_answer = outcome("messages")(-1).content material
    tools_used = extract_tool_names(outcome)
    return tools_used, raw_answer

Let’s ask the agent for the highest 5 breaking information tales. It ought to use the tavily_search_results_json instrument.

instruments, reply = run_agent("What are the highest 5 breaking information tales?")
print("Instruments used ➡️", instruments)
print(reply)

Output:

Instruments used ➡️ ('tavily_search_results_json')
Listed below are the highest 5 breaking information tales primarily based on the offered sources:

1.  **Gaza Humanitarian Disaster:** Ongoing battle and challenges in Gaza, together with the Eid al-Adha vacation, and the retrieval of a Thai hostage's physique.
2.  **Russian Drone Assaults on Kharkiv:** Russia continues to focus on Ukrainian cities with drone and missile strikes.
3.  **Wagner Group Departure from Mali:** The Wagner Group is leaving Mali after heavy losses, however Russia's Africa Corps stays.
4.  **Trump-Musk Feud:** A dispute between former President Trump and Elon Musk may have implications for Tesla inventory and the U.S. house program.
5.  **Training Division Staffing Cuts:** The Biden administration is searching for Supreme Courtroom intervention to dam deliberate staffing cuts on the Training Division.

Let’s ask the agent to jot down and execute Python code for the Fibonacci sequence. It ought to use the Python_REPL instrument.

instruments, reply = run_agent(
    "Write a code for the Fibonacci sequence and execute it utilizing Python REPL."
)
print("Instruments used ➡️", instruments)
print(reply)

Output:

Instruments used ➡️ ('Python_REPL')
The Fibonacci sequence as much as 10 phrases is (0, 1, 1, 2, 3, 5, 8, 13, 21, 34).

Ultimate Ideas

On this information, I’ve used a small quantized LLM, which generally struggles with accuracy, particularly in relation to deciding on the instruments. In case your purpose is to construct production-ready AI brokers, I extremely suggest operating the newest, full-sized fashions with llama.cpp. Bigger and more moderen fashions usually present higher outcomes and extra dependable outputs

It’s necessary to notice that organising llama.cpp might be more difficult in comparison with user-friendly instruments like Ollama. Nonetheless, in case you are prepared to take a position the time to debug, optimize, and tailor llama.cpp in your particular {hardware}, the efficiency features and suppleness are nicely price it.

One of many largest benefits of llama.cpp is its effectivity: you don’t want high-end {hardware} to get began. It runs nicely on common CPUs and laptops with out devoted GPUs, making native AI accessible to nearly everybody. And should you ever want extra energy, you possibly can all the time lease an inexpensive GPU occasion from a cloud supplier.

Abid Ali Awan (@1abidaliawan) is a licensed knowledge scientist skilled who loves constructing machine studying fashions. Presently, he’s specializing in content material creation and writing technical blogs on machine studying and knowledge science applied sciences. Abid holds a Grasp’s diploma in expertise administration and a bachelor’s diploma in telecommunication engineering. His imaginative and prescient is to construct an AI product utilizing a graph neural community for college students fighting psychological sickness.

Constructing AI Brokers with llama.cpp

1. Organising the llama.cpp Server

2. Constructing an AI Agent with Langgraph and llama.cpp

3. Check the AI Agent with Instance Queries

Ultimate Ideas

Related Articles

Microsoft’s customized AI chip hits delays, giving Nvidia extra runway

Why We’re Deepening Our Relationship With Actual Madrid

10 Methods To Save The Earth (& Cash) In Underneath A Minute

LEAVE A REPLY Cancel reply

Latest Articles

Microsoft’s customized AI chip hits delays, giving Nvidia extra runway

Why We’re Deepening Our Relationship With Actual Madrid

10 Methods To Save The Earth (& Cash) In Underneath A Minute

How You Can Shield Your Funding Properties From Pure Disasters

1000’s arrange avenue blockades in Serbia after arrests of protesters