GO
| HSI1 | 24,842.67 | +124.57 | 283.35B |
| HSCEI1 | 8,375.74 | +1.31 | 84.06B |
| Back Zoom + Zoom - Block Traded | |
|
TENCENT Hunyuan AI Infra Open-Sources Upgraded HPC-Ops Inference Core Operators
2026-06-11 16:46:41 TENCENT Hunyuan announced that its HPC-Ops inference operator library has undergone a system-level upgrade, evolving from standalone operators into a comprehensive optimization suite covering the entire inference pipeline, including five key operators. This upgrade effectively addresses real-world engineering bottlenecks on mainstream inference platforms, such as long-tail latency in Attention, GPU memory transfer overhead, and cross-card communication. Multiple performance metrics significantly outperform existing open-source baselines. HPC-Ops is an industrial-grade, high-performance large model inference operator library open-sourced and long maintained by the TENCENT Hunyuan AI Infra team. Key highlights of this upgrade include: Attention: To tackle computation imbalance and long-tail inference issues caused by mixed short and long requests under real workloads, a runtime dynamic load scheduling solution is adopted. Tests show up to 2.95x acceleration for long-text scenarios and up to 17% improvement in end-to-end QPM. Router GEMM: To achieve FP32-level high-precision computation through a dual BF16 GEMM combination, balancing inference accuracy and GPU utilization. Precision is significantly superior to conventional BF16/TF32 solutions, with up to 3.22x speedup compared with CuBLAS FP32. FusedMoE: To establish a full-module MoE pipeline, integrating multi-stage processes while eliminating GPU memory transfer and kernel launch overhead. Compared with mainstream frameworks such as vLLM and SGLang, performance improves by 1.2-1.6x. Fused AllReduce+Norm: To deeply integrate cross-GPU communication, residual addition, and normalization computation. Compared with mainstream solutions including NCCL and FlashInfer, performance achieves 1.04-1.68x acceleration. Sampler: To consolidate sampling computation in the decoding stage, originally requiring more than ten operator steps, into two CUDA kernels, significantly reducing scheduling, read-write, and synchronization overhead. Compared with vLLM, speed increases by 4.0-7.5x, and by 1.9-4.7x versus FlashInfer, addressing inference-end bottlenecks. ~ AASTOCKS Financial News Website: www.aastocks.com | |