Qwen/Qwen3.6-35B-A3B-FP8

vllm

vLLM serving Qwen3.6-35B-A3B-FP8 with MTP speculative decoding on a single GB10

Quick Info

MODEL

Qwen/Qwen3.6-35B-A3B-FP8

RUNTIME

vllm

TENSOR PARALLEL

1

NODES

1

AUTHOR

S

Seth Hobson

Recipe YAML

description: vLLM serving Qwen3.6-35B-A3B-FP8 with MTP speculative decoding on a single GB10
model: Qwen/Qwen3.6-35B-A3B-FP8
container: vllm/vllm-openai:v0.20.0-aarch64-cu130-ubuntu2404
defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 1
  gpu_memory_utilization: 0.7069
  max_model_len: 262144
  max_num_batched_tokens: 32768
  max_num_seqs: 20
env:
  VLLM_MARLIN_USE_ATOMIC_ADD: '1'
  CUDA_MANAGED_FORCE_DEVICE_ALLOC: '1'
  PYTORCH_CUDA_ALLOC_CONF: expandable_segments:True
  OMP_NUM_THREADS: '4'
command: |
  vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
    --served-model-name Qwen/Qwen3.6-35B-A3B-FP8 qwen3.6-35b-a3b \
    --tensor-parallel-size {tensor_parallel} \
    --port {port} --host {host} \
    --max-model-len {max_model_len} \
    --max-num-seqs {max_num_seqs} \
    --max-num-batched-tokens {max_num_batched_tokens} \
    --gpu-memory-utilization {gpu_memory_utilization} \
    --attention-backend flashinfer \
    --kv-cache-dtype fp8 \
    --enable-prefix-caching \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --speculative-config '{{"method":"mtp","num_speculative_tokens":1}}' \
    --trust-remote-code
recipe_version: '1'
name: Qwen3.6-35B-A3B-FP8-MTP
cluster_only: false
solo_only: true

Benchmarks

Prompt Processing (PP) — Tokens/sec vs Concurrency

Text Generation (TG) — Tokens/sec vs Concurrency

Each line represents a different depth/variant. X-axis is how many requests are running at the same time (concurrency).

Higher on the Y-axis = faster. More concurrent requests usually means more total throughput but slower individual responses.

✨ Click any line in the chart or legend to see what it measures.