Introduction
Imaginative and prescient-Language Fashions (VLMs) are quickly turning into the core of many generative AI purposes, from multimodal chatbots and agentic programs to automated content material evaluation instruments. As open-source fashions mature, they provide promising options to proprietary programs, enabling builders and enterprises to construct cost-effective, scalable, and customizable AI options.
Nonetheless, the rising variety of VLMs presents a standard dilemma: how do you select the proper mannequin on your use case? It is usually a balancing act between output high quality, latency, throughput, context size, and infrastructure value.
This weblog goals to simplify the decision-making course of by offering detailed benchmarks and mannequin descriptions for 3 main open-source VLMs: Gemma-3-4B, MiniCPM-o 2.6, and Qwen2.5-VL-7B-Instruct. All benchmarks had been run utilizing Clarifai’s Compute Orchestration, our personal inference engine, to make sure constant situations and dependable comparisons throughout fashions.
Earlier than diving into the outcomes, right here’s a fast breakdown of the important thing metrics used within the benchmarks. All outcomes had been generated utilizing Clarifai’s Compute Orchestration on NVIDIA L40S GPUs, with enter tokens set to 500 and output tokens set to 150.
- Latency per Token: The time it takes to generate every output token. Decrease latency means quicker responses, particularly necessary for chat-like experiences.
- Time to First Token (TTFT): Measures how rapidly the mannequin generates the primary token after receiving the enter. It impacts perceived responsiveness in streaming technology duties.
- Finish-to-Finish Throughput: The variety of tokens the mannequin can generate per second for a single request, contemplating the total request processing time. Increased end-to-end throughput means the mannequin can effectively generate output whereas conserving latency low.
-
General Throughput: The overall variety of tokens generated per second throughout all concurrent requests. This displays the mannequin’s skill to scale and keep efficiency below load.
Now, let’s dive into the small print of every mannequin, beginning with Gemma-3-4B.
Gemma3-4b
Gemma-3-4B, a part of Google’s newest Gemma 3 household of open multimodal fashions, is designed to deal with each textual content and picture inputs, producing coherent and contextually wealthy textual content responses. With assist for as much as 128K context tokens, 140+ languagesand duties like textual content technology, picture understanding, reasoning, and summarization, it’s constructed for production-grade purposes throughout numerous use circumstances.
Benchmark Abstract: Efficiency on L40S GPU
Gemma-3-4B continues to indicate robust efficiency throughout each textual content and picture duties, with constant habits below various concurrency ranges. All benchmarks had been run utilizing Clarifai’s Compute Orchestration with enter measurement of 500 tokens and output measurement of 150 tokens. Gemma-3-4B is optimized for low-latency textual content processing and handles picture inputs as much as 512px with secure throughput throughout concurrency ranges.
Textual content-Solely Efficiency Highlights:
-
Latency per token: 0.022 sec (1 concurrent request)
-
Time to First Token (TTFT): 0.135 sec
-
Finish-to-end throughput: 202.25 tokens/sec
-
Requests per minute (RPM): As much as 329.90 at 32 concurrent requests
-
General throughput: 942.57 tokens/sec at 32 concurrency
Multimodal (Picture + Textual content) Efficiency (General Throughput):
-
256px pictures: 718.63 tokens/sec, 252.16 RPM at 32 concurrency
-
512px pictures: 688.21 tokens/sec, 242.04 RPM
Scales with Concurrency (Finish-to-Finish Throughput):
-
At 2 concurrent requests:
-
At 8 concurrent requests:
-
At 16 concurrent requests:
-
At 32 concurrent requests:
General Perception:
Gemma-3-4B supplies quick and dependable efficiency for text-heavy and structured vision-language duties. For big picture inputs (512px), efficiency stays secure, however you might have to scale compute sources to keep up low latency and excessive throughput.
If you happen to’re evaluating GPU efficiency for serving this mannequin, we’ve revealed a separate comparability of A10 VS. L40Sserving to you select the perfect {hardware} on your wants.
MiniCPM-o 2.6
MiniCPM-o 2.6 represents a serious leap in end-side multimodal LLMs. It expands enter modalities to pictures, video, audio, and textual content, providing real-time speech dialog and multimodal streaming assist.
With an structure integrating SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B, the mannequin boasts a complete of 8 billion parameters. MiniCPM-o-2.6 demonstrates vital enhancements over its predecessor, MiniCPM-V 2.6, and introduces real-time speech dialog, multimodal stay streaming, and superior effectivity in token processing.
Benchmark Abstract: Efficiency on L40S GPU
All benchmarks had been run utilizing Clarifai’s Compute Orchestration with enter measurement of 500 tokens and output measurement of 150 tokens. MiniCPM-o-2.6 performs exceptionally effectively throughout each textual content and picture workloads, scaling easily throughout concurrency ranges. Shared vLLM serving supplies vital good points in total throughput whereas sustaining low latency.
Textual content-Solely Efficiency Highlights:
-
Latency per token: 0.022 sec (1 concurrent request)
-
Time to First Token (TTFT): 0.087 sec
-
Finish-to-end throughput: 213.23 tokens/sec
-
Requests per minute (RPM): As much as 362.83 at 32 concurrent requests
-
General throughput: 1075.28 tokens/sec at 32 concurrency
Multimodal (Picture + Textual content) Efficiency (General Throughput):
-
256px pictures: 1039.60 tokens/sec, 353.19 RPM at 32 concurrency
-
512px pictures: 957.37 tokens/sec, 324.66 RPM
Scales with Concurrency (Finish-to-Finish Throughput):
-
At 2 concurrent requests:
-
At 8 concurrent requests:
-
At 16 concurrent requests:
-
At 32 concurrent requests:
General Perception:
MiniCPM-o-2.6 performs reliably throughout a spread of duties and enter sizes. It maintains low latency, scales linearly with concurrency, and stays performant even with 512px picture inputs. This makes it a stable selection for real-time purposes working on trendy GPUs just like the L40S. These outcomes mirror efficiency on that particular {hardware} configuration, and will fluctuate relying on the setting or GPU tier.
Qwen2.5-VL-7B-Instruct
Qwen2.5-VL is a vision-language mannequin designed for visible recognition, reasoning, lengthy video evaluation, object localization, and structured information extraction.
Its structure integrates window consideration into the Imaginative and prescient Transformer (ViT), considerably enhancing each coaching and inference effectivity. Extra optimizations like SwiGLU activation and RMSNorm additional align the ViT with the Qwen2.5 LLM, enhancing total efficiency and consistency.
Benchmark Abstract: Efficiency on L40S GPU
Qwen2.5-VL-7B-Instruct delivers constant efficiency throughout each textual content and image-based duties. Benchmarks from Clarifai’s Compute Orchestration spotlight its skill to deal with multimodal inputs at scale, with robust throughput and responsiveness below various concurrency ranges.
Textual content-Solely Efficiency Highlights:
-
Latency per token: 0.022 sec (1 concurrent request)
-
Time to First Token (TTFT): 0.089 sec
-
Finish-to-end throughput: 205.67 tokens/sec
-
Requests per minute (RPM): As much as 353.78 at 32 concurrent requests
-
General throughput: 1017.16 tokens/sec at 32 concurrency
Multimodal (Picture + Textual content) Efficiency (General Throughput):
-
256px pictures: 854.53 tokens/sec, 318.64 RPM at 32 concurrency
-
512px pictures: 832.28 tokens/sec, 345.98 RPM
Scales with Concurrency (Finish-to-Finish Throughput):
-
At 2 concurrent requests:
-
At 8 concurrent requests:
-
At 16 concurrent requests:
-
At 32 concurrent requests:
General Perception:
Qwen2.5-VL-7B-Instruct is well-suited for each textual content and multimodal duties. Whereas bigger pictures introduce latency and throughput trade-offs, the mannequin performs reliably with small to medium-sized inputs even at excessive concurrency. It’s a powerful selection for scalable vision-language pipelines that prioritize throughput and average latency.
Which VLM is Proper for You?
Choosing the proper Imaginative and prescient-Language Mannequin (VLM) depends upon your workload kind, enter modality, and concurrency necessities. All benchmarks on this report had been generated utilizing NVIDIA L40S GPUs by way of Clarifai’s Compute Orchestration.
These outcomes mirror efficiency on enterprise-grade infrastructure. If you happen to’re utilizing lower-end {hardware} or focusing on bigger batch sizes or ultra-low latency, precise efficiency might differ. It’s necessary to judge primarily based in your particular deployment setup.
MiniCPM-o-2.6
MiniCPM presents constant efficiency throughout each textual content and picture duties, particularly when deployed with shared vLLM. It scales effectively as much as 32 concurrent requests, sustaining excessive throughput and low latency even with 1024px picture inputs.
In case your software requires secure efficiency below load and suppleness throughout modalities, MiniCPM is probably the most well-rounded selection on this group.
Gemma-3-4B
Gemma performs greatest on text-heavy workloads with occasional picture enter. It handles concurrency effectively as much as 16 requests however begins to dip at 32, significantly with giant pictures reminiscent of 2048px.
In case your use case is primarily targeted on quick, high-quality textual content technology with small to medium picture inputs, Gemma supplies robust efficiency without having high-end scaling.
Qwen2.5-VL-7B-Instruct
Qwen2.5 is optimized for structured vision-language duties reminiscent of doc parsing, OCR, and multimodal reasoning, making it a powerful selection for purposes that require exact visible and textual understanding.
In case your precedence is correct visible reasoning and multimodal understanding, Qwen2.5 is a powerful match, particularly when output high quality issues greater than peak throughput.
That can assist you examine at a look, right here’s a abstract of the important thing efficiency metrics for all three fashions at 32 concurrent requests throughout textual content and picture inputs.
Imaginative and prescient-Language Mannequin Benchmark Abstract (32 Concurrent Requests, L40S GPU)
Metric | Mannequin | Textual content Solely | 256px Picture | 512px Picture |
---|---|---|---|---|
Latency per Token (sec) | Gemma-3-4B | 0.027 | 0.036 | 0.037 |
MiniCPM-o 2.6 | 0.024 | 0.026 | 0.028 | |
Qwen2.5-VL-7B-Instruct | 0.025 | 0.032 | 0.032 | |
Time to First Token (sec) | Gemma-3-4B | 0.236 | 1.034 | 1.164 |
MiniCPM-o 2.6 | 0.120 | 0.347 | 0.786 | |
Qwen2.5-VL-7B-Instruct | 0.121 | 0.364 | 0.341 | |
Finish-to-Finish Throughput (tokens/s) | Gemma-3-4B | 168.45 | 124.56 | 120.01 |
MiniCPM-o 2.6 | 188.86 | 176.29 | 160.14 | |
Qwen2.5-VL-7B-Instruct | 186.91 | 179.69 | 191.94 | |
General Throughput (tokens/s) | Gemma-3-4B | 942.58 | 718.63 | 688.21 |
MiniCPM-o 2.6 | 1075.28 | 1039.60 | 957.37 | |
Qwen2.5-VL-7B-Instruct | 1017.16 | 854.53 | 832.28 | |
Requests per Minute (RPM) | Gemma-3-4B | 329.90 | 252.16 | 242.04 |
MiniCPM-o 2.6 | 362.84 | 353.19 | 324.66 | |
Qwen2.5-VL-7B-Instruct | 353.78 | 318.64 | 345.98 |
Notice: These benchmarks had been run on L40S GPUs. Outcomes might fluctuate relying on GPU class (reminiscent of A100 or H100), CPU limitations, or runtime configurations together with batching, quantization, or mannequin variants.
Conclusion
Now we have seen the benchmarks throughout MiniCPM-2.6, Gemma-3-4B, and Qwen2.5-VL-7B-Instruct, protecting their efficiency on latency, throughput, and scalability below totally different concurrency ranges and picture sizes. Every mannequin performs otherwise primarily based on the duty and workload necessities.
If you wish to check out these fashions, we have now launched a brand new AI Playground the place you possibly can discover them immediately. We’ll proceed including the newest fashions to the platform, so control our updates and be part of our Discord neighborhood for the newest bulletins.
In case you are additionally seeking to deploy these Open Supply VLMs by yourself devoted compute, our platform helps production-grade inference, and scalable deployments. You possibly can rapidly get began with organising your individual node pool and working inference effectively. Take a look at the tutorial beneath to get began.