Qwen/Qwen3.6-35B-A3B-FP8
vllmvLLM serving Qwen3.6-35B-A3B-FP8 with MTP speculative decoding on a single GB10
Quick Info
MODEL
Qwen/Qwen3.6-35B-A3B-FP8
RUNTIME
vllm
TENSOR PARALLEL
1
NODES
1
AUTHOR
S
Seth Hobson
Recipe YAML
description: vLLM serving Qwen3.6-35B-A3B-FP8 with MTP speculative decoding on a single GB10
model: Qwen/Qwen3.6-35B-A3B-FP8
container: vllm/vllm-openai:v0.20.0-aarch64-cu130-ubuntu2404
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 1
gpu_memory_utilization: 0.7069
max_model_len: 262144
max_num_batched_tokens: 32768
max_num_seqs: 20
env:
VLLM_MARLIN_USE_ATOMIC_ADD: '1'
CUDA_MANAGED_FORCE_DEVICE_ALLOC: '1'
PYTORCH_CUDA_ALLOC_CONF: expandable_segments:True
OMP_NUM_THREADS: '4'
command: |
vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
--served-model-name Qwen/Qwen3.6-35B-A3B-FP8 qwen3.6-35b-a3b \
--tensor-parallel-size {tensor_parallel} \
--port {port} --host {host} \
--max-model-len {max_model_len} \
--max-num-seqs {max_num_seqs} \
--max-num-batched-tokens {max_num_batched_tokens} \
--gpu-memory-utilization {gpu_memory_utilization} \
--attention-backend flashinfer \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--speculative-config '{{"method":"mtp","num_speculative_tokens":1}}' \
--trust-remote-code
recipe_version: '1'
name: Qwen3.6-35B-A3B-FP8-MTP
cluster_only: false
solo_only: true