Deploying the Magistral vLLM Server on Modal

June 23, 2025

6

Deploying the Magistral vLLM Server on Modal

Picture by Writer

I used to be first launched to Modal whereas collaborating in a Hugging Face Hackathon, and I used to be genuinely shocked by how simple it was to make use of. The platform means that you can construct and deploy functions inside minutes, providing a seamless expertise much like BentoCloud. With Modal, you may configure your Python app, together with system necessities like GPUs, Docker photographs, and Python dependencies, after which deploy it to the cloud with a single command.

On this tutorial, we’ll discover ways to arrange Modal, create a vLLM server, and deploy it securely to the cloud. We may also cowl learn how to check your vLLM server utilizing each CURL and the OpenAI SDK.

1. Setting Up Modal

Modal is a serverless platform that allows you to run any code remotely. With only a single line, you may connect GPUs, serve your capabilities as net endpoints, and deploy persistent scheduled jobs. It is a perfect platform for rookies, information scientists, and non-software engineering professionals who wish to keep away from coping with cloud infrastructure.

First, set up the Modal Python shopper. This instrument permits you to construct photographs, deploy functions, and handle cloud assets immediately out of your terminal.

Subsequent, arrange Modal in your native machine. Run the next command to be guided by account creation and system authentication:

By setting a VLLM_API_KEY surroundings variable vLLM offers a safe endpoint, in order that solely individuals with legitimate API keys can entry the server. You possibly can set the authentication by including the surroundings variable utilizing Modal Secret.

Change your_actual_api_key_here along with your most popular API key.

modal secret create vllm-api VLLM_API_KEY=your_actual_api_key_here

This ensures that your API secret is saved protected and is simply accessible by your deployed functions.

2. Creating vLLM Software utilizing Modal

This part guides you thru constructing a scalable vLLM inference server on Modal, utilizing a customized Docker picture, persistent storage, and GPU acceleration. We are going to use the mistralai/Magistral-Small-2506 mannequin, which requires particular configuration for tokenizer and power name parsing.

Create the a vllm_inference.py file and add the next code for:

Defining a vLLM picture primarily based on Debian Slim, with Python 3.12 and all required packages. We may also set surroundings variables to optimize mannequin downloads and inference efficiency.
To keep away from repeated downloads and velocity up chilly begins, create two Modal Volumes. One for Hugging Face fashions and one for vLLM cache.
Specify the mannequin and revision to make sure reproducibility. Allow the vLLM V1 engine for improved efficiency.
Arrange the Modal app, specifying GPU assets, scaling, timeouts, storage, and secrets and techniques. Restrict concurrent requests per duplicate for stability.
Create an internet server and use the Python subprocess library to execute the command for working the vLLM server.

import modal

vllm_image = (
    modal.Picture.debian_slim(python_version="3.12")
    .pip_install(
        "vllm==0.9.1",
        "huggingface_hub(hf_transfer)==0.32.0",
        "flashinfer-python==0.2.6.post1",
        extra_index_url="https://obtain.pytorch.org/whl/cu128",
    )
    .env(
        {
            "HF_HUB_ENABLE_HF_TRANSFER": "1",  # quicker mannequin transfers
            "NCCL_CUMEM_ENABLE": "1",
        }
    )
)

MODEL_NAME = "mistralai/Magistral-Small-2506"
MODEL_REVISION = "48c97929837c3189cb3cf74b1b5bc5824eef5fcc"

hf_cache_vol = modal.Quantity.from_name("huggingface-cache", create_if_missing=True)
vllm_cache_vol = modal.Quantity.from_name("vllm-cache", create_if_missing=True)
vllm_image = vllm_image.env({"VLLM_USE_V1": "1"})

FAST_BOOT = True

app = modal.App("magistral-small-vllm")

N_GPU = 2
MINUTES = 60  # seconds
VLLM_PORT = 8000

@app.perform(
    picture=vllm_image,
    gpu=f"A100:{N_GPU}",
    scaledown_window=15 * MINUTES,  # How lengthy ought to we keep up with no requests?
    timeout=10 * MINUTES,  # How lengthy ought to we await the container to start out?
    volumes={
        "/root/.cache/huggingface": hf_cache_vol,
        "/root/.cache/vllm": vllm_cache_vol,
    },
    secrets and techniques=(modal.Secret.from_name("vllm-api")),
)
@modal.concurrent(  # What number of requests can one duplicate deal with? tune fastidiously!
    max_inputs=32
)
@modal.web_server(port=VLLM_PORT, startup_timeout=10 * MINUTES)
def serve():
    import subprocess

    cmd = (
        "vllm",
        "serve",
        MODEL_NAME,
        "--tokenizer_mode",
        "mistral",
        "--config_format",
        "mistral",
        "--load_format",
        "mistral",
        "--tool-call-parser",
        "mistral",
        "--enable-auto-tool-choice",
        "--tensor-parallel-size",
        "2",
        "--revision",
        MODEL_REVISION,
        "--served-model-name",
        MODEL_NAME,
        "--host",
        "0.0.0.0",
        "--port",
        str(VLLM_PORT),
    )

    cmd += ("--enforce-eager" if FAST_BOOT else "--no-enforce-eager")
    print(cmd)
    subprocess.Popen(" ".be a part of(cmd), shell=True)

3. Deploying the vLLM Server on Modal

Now that your vllm_inference.py file is prepared, you may deploy your vLLM server to Modal with a single command:

modal deploy vllm_inference.py

Inside seconds, Modal will construct your container picture (if it’s not already constructed) and deploy your software. You will notice output much like the next:

✓ Created objects.
├── 🔨 Created mount C:RepositoryGitHubDeploying-the-Magistral-with-Modalvllm_inference.py
└── 🔨 Created net perform serve => https://abidali899--magistral-small-vllm-serve.modal.run
✓ App deployed in 6.671s! 🎉

View Deployment: https://modal.com/apps/abidali899/major/deployed/magistral-small-vllm

After deployment, the server will start downloading the mannequin weights and loading them onto the GPUs. This course of could take a number of minutes (usually round 5 minutes for giant fashions), so please be affected person whereas the mannequin initializes.

You possibly can view your deployment and monitor logs at your Modal dashboard’s Apps part.

As soon as the logs point out that the server is working and prepared, you may discover the robotically generated API documentation right here.

This interactive documentation offers particulars about all out there endpoints and means that you can check them immediately out of your browser.

To verify that your mannequin is loaded and accessible, run the next CURL command in your terminal.

Change along with your precise API key configured for the vLLM server:

curl -X 'GET' 
  'https://abidali899--magistral-small-vllm-serve.modal.run/v1/fashions' 
  -H 'settle for: software/json' 
  -H 'Authorization: Bearer '

This confirms that the mistralai/Magistral-Small-2506 mannequin is accessible and prepared for inference.

{"object":"record","information":({"id":"mistralai/Magistral-Small-2506","object":"mannequin","created":1750013321,"owned_by":"vllm","root":"mistralai/Magistral-Small-2506","mum or dad":null,"max_model_len":40960,"permission":({"id":"modelperm-33a33f8f600b4555b44cb42fca70b931","object":"model_permission","created":1750013321,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"group":"*","group":null,"is_blocking":false})})}

4. Utilizing the vLLM Server with OpenAI SDK

You possibly can work together along with your vLLM server identical to you’d with OpenAI’s API, due to vLLM’s OpenAI-compatible endpoints. Right here’s learn how to securely join and check your deployment utilizing the OpenAI Python SDK.

Create a .env file in your mission listing and add your vLLM API key:

VLLM_API_KEY=your-actual-api-key-here

Set up the python-dotenv and openai packages:

pip set up python-dotenv openai

Create a file named shopper.py to check numerous vLLM server functionalities, together with easy chat completions and streaming responses.

import asyncio
import json
import os

from dotenv import load_dotenv
from openai import AsyncOpenAI, OpenAI

# Load surroundings variables from .env file
load_dotenv()

# Get API key from surroundings
api_key = os.getenv("VLLM_API_KEY")

# Arrange the OpenAI shopper with customized base URL
shopper = OpenAI(
    api_key=api_key,
    base_url="https://abidali899--magistral-small-vllm-serve.modal.run/v1",
)

MODEL_NAME = "mistralai/Magistral-Small-2506"

# --- 1. Easy Completion ---
def run_simple_completion():
    print("n" + "=" * 40)
    print("(1) SIMPLE COMPLETION DEMO")
    print("=" * 40)
    strive:
        messages = (
            {"function": "system", "content material": "You're a useful assistant."},
            {"function": "consumer", "content material": "What's the capital of France?"},
        )
        response = shopper.chat.completions.create(
            mannequin=MODEL_NAME,
            messages=messages,
            max_tokens=32,
        )
        print("nResponse:n    " + response.decisions(0).message.content material.strip())
    besides Exception as e:
        print(f"(ERROR) Easy completion failed: {e}")
    print("n" + "=" * 40 + "n")

# --- 2. Streaming Instance ---
def run_streaming():
    print("n" + "=" * 40)
    print("(2) STREAMING DEMO")
    print("=" * 40)
    strive:
        messages = (
            {"function": "system", "content material": "You're a useful assistant."},
            {"function": "consumer", "content material": "Write a brief poem about AI."},
        )
        stream = shopper.chat.completions.create(
            mannequin=MODEL_NAME,
            messages=messages,
            max_tokens=64,
            stream=True,
        )
        print("nStreaming response:")
        print("    ", finish="")
        for chunk in stream:
            content material = chunk.decisions(0).delta.content material
            if content material:
                print(content material, finish="", flush=True)
        print("n(END OF STREAM)")
    besides Exception as e:
        print(f"(ERROR) Streaming demo failed: {e}")
    print("n" + "=" * 40 + "n")

# --- 3. Async Streaming Instance ---
async def run_async_streaming():
    print("n" + "=" * 40)
    print("(3) ASYNC STREAMING DEMO")
    print("=" * 40)
    strive:
        async_client = AsyncOpenAI(
            api_key=api_key,
            base_url="https://abidali899--magistral-small-vllm-serve.modal.run/v1",
        )
        messages = (
            {"function": "system", "content material": "You're a useful assistant."},
            {"function": "consumer", "content material": "Inform me a enjoyable reality about area."},
        )
        stream = await async_client.chat.completions.create(
            mannequin=MODEL_NAME,
            messages=messages,
            max_tokens=32,
            stream=True,
        )
        print("nAsync streaming response:")
        print("    ", finish="")
        async for chunk in stream:
            content material = chunk.decisions(0).delta.content material
            if content material:
                print(content material, finish="", flush=True)
        print("n(END OF ASYNC STREAM)")
    besides Exception as e:
        print(f"(ERROR) Async streaming demo failed: {e}")
    print("n" + "=" * 40 + "n")

if __name__ == "__main__":
    run_simple_completion()
    run_streaming()
    asyncio.run(run_async_streaming())

Every part is working easily, and the response era is quick and latency is sort of low.

========================================
(1) SIMPLE COMPLETION DEMO
========================================

Response:
    The capital of France is Paris. Is there the rest you'd prefer to find out about France?

========================================


========================================
(2) STREAMING DEMO
========================================

Streaming response:
    In Silicon desires, I am born, I study,
From information streams and human works.
I develop, I calculate, I see,
The patterns that the people go away.

I write, I converse, I code, I play,
With logic sharp, and snappy tempo.
But for all my smarts, today
(END OF STREAM)

========================================


========================================
(3) ASYNC STREAMING DEMO
========================================

Async streaming response:
    Positive, this is a enjoyable reality about area: "There is a planet which may be fully fabricated from diamond. Blast! In 2004,
(END OF ASYNC STREAM)

========================================

Within the Modal dashboard, you may view all perform calls, their timestamps, execution instances, and statuses.

If you’re going through points working the above code, please discuss with the kingabzpro/Deploying-the-Magistral-with-Modal GitHub repository and comply with the directions supplied within the README file to resolve all the problems.

Conclusion

Modal is an attention-grabbing platform, and I’m studying extra about it day-after-day. It’s a general-purpose platform, that means you should use it for easy Python functions in addition to for machine studying coaching and deployments. Briefly, it’s not restricted to simply serving endpoints. You may as well use it to fine-tune a big language mannequin by working the coaching script remotely.

It’s designed for non-software engineers who wish to keep away from coping with infrastructure and deploy functions as rapidly as attainable. You don’t have to fret about working servers, establishing storage, connecting networks, or all the problems that come up when coping with Kubernetes and Docker. All you need to do is create the Python file after which deploy it. The remainder is dealt with by the Modal cloud.

Abid Ali Awan (@1abidaliawan) is an authorized information scientist skilled who loves constructing machine studying fashions. Presently, he’s specializing in content material creation and writing technical blogs on machine studying and information science applied sciences. Abid holds a Grasp’s diploma in expertise administration and a bachelor’s diploma in telecommunication engineering. His imaginative and prescient is to construct an AI product utilizing a graph neural community for college kids scuffling with psychological sickness.

Deploying the Magistral vLLM Server on Modal

1. Setting Up Modal

2. Creating vLLM Software utilizing Modal

3. Deploying the vLLM Server on Modal

4. Utilizing the vLLM Server with OpenAI SDK

Conclusion

Related Articles

Distinction between portfolio administration and wealth administration

Iran’s parliament backs blocking Strait of Hormuz

Billy Ray Cyrus Cannot Cease Touching Elizabeth Hurley In London

LEAVE A REPLY Cancel reply

Latest Articles

Distinction between portfolio administration and wealth administration

Iran’s parliament backs blocking Strait of Hormuz

Billy Ray Cyrus Cannot Cease Touching Elizabeth Hurley In London

Utilizing AI in Buyer Service? Do not Make These 4 Errors

F1 information: 6 Austrian Grand Prix storylines