HSI1 24,842.67 +124.57 283.35B
HSCEI1 8,375.74 +1.31 84.06B
Back    Zoom +    Zoom - Block Traded
TENCENT Hunyuan AI Infra Open-Sources Upgraded HPC-Ops Inference Core Operators
2026-06-11 16:46:41
TENCENT Hunyuan announced that its HPC-Ops inference operator library has undergone a system-level upgrade, evolving from standalone operators into a comprehensive optimization suite covering the entire inference pipeline, including five key operators.

This upgrade effectively addresses real-world engineering bottlenecks on mainstream inference platforms, such as long-tail latency in Attention, GPU memory transfer overhead, and cross-card communication. Multiple performance metrics significantly outperform existing open-source baselines.

HPC-Ops is an industrial-grade, high-performance large model inference operator library open-sourced and long maintained by the TENCENT Hunyuan AI Infra team. Key highlights of this upgrade include:

Attention: To tackle computation imbalance and long-tail inference issues caused by mixed short and long requests under real workloads, a runtime dynamic load scheduling solution is adopted. Tests show up to 2.95x acceleration for long-text scenarios and up to 17% improvement in end-to-end QPM.

Router GEMM: To achieve FP32-level high-precision computation through a dual BF16 GEMM combination, balancing inference accuracy and GPU utilization. Precision is significantly superior to conventional BF16/TF32 solutions, with up to 3.22x speedup compared with CuBLAS FP32.

FusedMoE: To establish a full-module MoE pipeline, integrating multi-stage processes while eliminating GPU memory transfer and kernel launch overhead. Compared with mainstream frameworks such as vLLM and SGLang, performance improves by 1.2-1.6x.

Fused AllReduce+Norm: To deeply integrate cross-GPU communication, residual addition, and normalization computation. Compared with mainstream solutions including NCCL and FlashInfer, performance achieves 1.04-1.68x acceleration.

Sampler: To consolidate sampling computation in the decoding stage, originally requiring more than ten operator steps, into two CUDA kernels, significantly reducing scheduling, read-write, and synchronization overhead. Compared with vLLM, speed increases by 4.0-7.5x, and by 1.9-4.7x versus FlashInfer, addressing inference-end bottlenecks.
~

AASTOCKS Financial News
Website: www.aastocks.com