NVIDIA’s technical blog on June 24 published benchmark results for DFlash, a block-diffusion speculative decoding technique developed at UC San Diego, reporting throughput gains of up to 15x over standard autoregressive decoding on Blackwell hardware (NVIDIA Technical Blog). Where conventional speculative decoding generates one candidate token per draft-model forward pass, DFlash uses a lightweight diffusion drafter to propose a full block of tokens in a single pass, conditioned on the target model’s hidden states through KV injection (MarkTechPost). The original UC San Diego paper reports up to 6.08x lossless speedup on Qwen3-8B in SGLang on a single B200 GPU; NVIDIA’s deployment benchmarks on multi-GPU Blackwell configurations extend reported gains to 15x at fixed interactivity constraints (NVIDIA Technical Blog). The team has released 20 DFlash model checkpoints on Hugging Face spanning Qwen, Llama, and Gemma families, with integration shipping in SGLang, vLLM, and TensorRT-LLM (NVIDIA Technical Blog).
Mistral AI on June 23 released OCR 4, a document extraction model that extends its predecessor’s plain-text output with a structured block-level representation (Mistral AI). Each processed page segment returns a bounding box, a typed classification - covering titles, tables, equations, signatures, and other block categories - alongside per-word confidence scores, making outputs directly consumable by retrieval-augmented generation pipelines without a post-processing step (MarkTechPost). The model supports 170 languages, runs in a single self-hosted container, and is priced at $4 per 1,000 pages, dropping to $2 with the batch API (VentureBeat, Mistral AI).
Prime Intellect on June 23 released prime-rl 0.6.0, a reinforcement learning framework targeting post-training of trillion-parameter Mixture-of-Experts models on long-horizon agentic tasks (MarkTechPost). The team reports training GLM-5 on software engineering tasks at up to 131,000-token sequence lengths, with step times under five minutes and a batch size of 256 rollouts using 28 H200 nodes (Prime Intellect). The framework disaggregates trainer and inference processes, applying FP8 with wide expert parallelism and KV offloading on the inference side, and three-dimensional parallelism with block-scaled FP8 for training - an architecture designed to lower the per-step cost of agentic post-training on open-source MoE models at scale (MarkTechPost).