During this conference the following directions will be covered:
AI inference: In practical production environment, it is needed to serve multiple large models on multiple devices. Explore the key challenges in the inference serving system for achieving best latency and throughput. Discuss the latest trend for new large models including Sora, Jamba etc. By exploring the architecture of them, the key challenges for the accuracy and inference performance for future large models are to be observed. Introduce the key challenges on fusing heavy kernels (GEMM and Conv with other elementwise kernels) automatical code gen for heterogeneous devices like NPUs.
AI algorithm: Diffusion models have shown remarkable success in the fields of image synthesis and related generative tasks. Reduce the inference cost of the diffusion model and improve the throughput by using the quantization and inference acceleration algorithm. Large model tuning is a key step to enable customers to develop large model applications. Propose a novel methodology that continues advances in this field by introducing backpropagation-less method for updating re-parametrized weights.
Residual Number System (RNS): In this talk the basics of RNS and its applicability to Neural Network inference and training will be covered. The problem of RNS applicability to the floating-point arithmetic is being explored, as well as the latest approaches related to division in RNS. In this talk novel math approaches (e.g. Residue Number System) for the computing hardware architectures will be explored targeting hardware costs and power consumption reduction. Highlight hardware optimization techniques related to math arithmetics implementation in the computing hardware.
Performance tools: Introduce the important features on the performance analysis tools, and to discuss how to extend the metrics to help the scenario performance analysis like virtualization.
DBI: Roofline analysis is powerful instrument, which can compare application performance against hardware capabilities, find out which level of system degrades application performance of and suggest the profitable optimization techniques.
Parallel runtime: Introduce a family of memory allocation libraries including serial and parallel allocators optimized to different size classes, page sizes, micro architectures, compliant (or not compliant) to C99 and C11 standards.
GCC: Discuss optimization opportunities for GCC. Discover methodology of performance findings and related problems. Vision of modern compiler optimizations.
Math library: Modern CPU architectures have tremendously more core count and deeper compute/latency/bandwidth disbalance. Providing scalable linear algebra solvers need efforts in frameworks, algorithms, tools, and many more. In large clusters, communication latency/bandwidth characteristics are largely heterogeneous. Exploiting the topology are the key for efficient distributed sparse solvers.