AASTOCKS Financial News

HSI¹	24,842.67	+124.57	283.35B
HSCEI¹	8,375.74	+1.31	84.06B

Back Zoom + Zoom - Block Traded

TENCENT Hunyuan AI Infra Open-Sources Upgraded HPC-Ops Inference Core Operators
2026-06-11 16:46:41
TENCENT Hunyuan announced that its HPC-Ops inference operator library has undergone a system-level upgrade, evolving from standalone operators into a comprehensive optimization suite covering the entire inference pipeline, including five key operators.

This upgrade effectively addresses real-world engineering bottlenecks on mainstream inference platforms, such as long-tail latency in Attention, GPU memory transfer overhead, and cross-card communication. Multiple performance metrics significantly outperform existing open-source baselines.

HPC-Ops is an industrial-grade, high-performance large model inference operator library open-sourced and long maintained by the TENCENT Hunyuan AI Infra team. Key highlights of this upgrade include:

Attention: To tackle computation imbalance and long-tail inference issues caused by mixed short and long requests under real workloads, a runtime dynamic load scheduling solution is adopted. Tests show up to 2.95x acceleration for long-text scenarios and up to 17% improvement in end-to-end QPM.

Router GEMM: To achieve FP32-level high-precision computation through a dual BF16 GEMM combination, balancing inference accuracy and GPU utilization. Precision is significantly superior to conventional BF16/TF32 solutions, with up to 3.22x speedup compared with CuBLAS FP32.

FusedMoE: To establish a full-module MoE pipeline, integrating multi-stage processes while eliminating GPU memory transfer and kernel launch overhead. Compared with mainstream frameworks such as vLLM and SGLang, performance improves by 1.2-1.6x.

Fused AllReduce+Norm: To deeply integrate cross-GPU communication, residual addition, and normalization computation. Compared with mainstream solutions including NCCL and FlashInfer, performance achieves 1.04-1.68x acceleration.

Sampler: To consolidate sampling computation in the decoding stage, originally requiring more than ten operator steps, into two CUDA kernels, significantly reducing scheduling, read-write, and synchronization overhead. Compared with vLLM, speed increases by 4.0-7.5x, and by 1.9-4.7x versus FlashInfer, addressing inference-end bottlenecks.
~

AASTOCKS Financial News
Website: www.aastocks.com

(1) HK Indices are real time

Home|RT Quote|Market|News|Indices
Feedback

View: Mobile|Desktop
Lang: 繁|简|EN

Disclaimer: AASTOCKS.com Ltd, HKEx Information Services Limited, its holding companies and/or any subsidiaries of such holding companies endeavour to ensure the accuracy and reliability of the Information provided but do not guarantee its accuracy or reliability and accept no liability (whether in tort or contract or otherwise) for any loss or damage arising from any inaccuracies or omissions.