ACAI 2023

10 - 14 JULY 2023

ISTANBUL

The Nearest Neighbor Search (NNS) problem involves finding the closest data point or set of points to a given query point based on a specified distance metric. NNS enables applications such as location-based services, social networks and friend recommendations. The task of finding the exact solution for the NNS problem starts being impractical as the datasets grow larger since the curse of dimensionality complicates the effectiveness of all NNS algorithms. Approximate methods are used in practice.
In this talk, we will introduce different variations of the nearest neighbor search problem and discuss related technical challenges.

Current Ascend Inference solution provide support only of static shapes or list of static shapes. We perform research and development of the highly optimized Graph Compiler solution for Ascend architecture effectevely supported dynamic shape operators and controlflow on Ascend with most effective memory management.

Analyzing performance of MPI application usually requires non-trivial approaches. Traditional hotspot-based analysis is often misleading because optimizing hot functions might not actually cause any speedup, but instead increase the time ranks spent on waiting each other.
One of the solutions is representing MPI program as a graph (known as Program Activity Graph) and perform only analysis of activities on Critical Path of this graph. Since real-life HPC applications running on large scale often produce huge Pro-gram Activity Graphs, performance of classical graph algorithms is quite poor. Moreover, using timing information only performance tools based on Critical Path provide limited capabilities. In this talk we describe an algorithm of building Program Activity Graph and calculating Criti-cal Path which naturally scales to the same amount of CPU cores as profiled MPI application uses. We also show how to combine the Critical Path analysis with the hardware-level performance data available on all the modern CPUs to enable efficient root causing of MPI imbalance issues even on very high scale.

In this report, we will present software solutions developed at Moscow State University that are aimed at ensuring high efficiency of supercomputer functioning. For example, we will talk about flexible monitoring system DiMMon, TASC software suite for smart HPC workload and job performance analysis, Octoshell supercomputer support system and ML-based methods for detecting similar applications.

Deadlock in MPI programs are usually hard to debug. Traditional debugging approaches are very limited for MPI applications especially on large scale and it is often not enough for deadlocks root causing to know involved ranks/MPI calls, more context is needed. In this talk we describe how our new approach based on combining traditional MPI tracing approach with a debugging technologies will help to automatically detect deadlock and immediately provide full debugging functionality with flexible control of all ranks.

In this talk I will review two use cases in my group for which we are looking forward to leverage CXL memory expansion as a replacement for persistent memories. The first is ecoHMEM, a framework for automatic data placement in heterogeneous memory systems recently released, and proved successful in KNL and Optane based systems. The second use case is HomE: homomorphically encrypted deep learning inference, a recently-started ERC grant, in which we are looking for the best hardware and software technology to execute production-sized neural networks in remote servers at reasonable speeds with the security guarantees of the memory-hungry homomorphic encryption.

Novel hardware design research requires special simulation and synthesis tools which can help with area&power estimation of the new chip. In addition, it is supposed to compare the developed solutions with the existing ones. In this topic full chip design flow via openSource tools would be presented. A fully open solution allows independent researchers to use final result numbers from competitor papers "as is" and compare them with proposal without having to implement the state of art.

Possibility of synthesis (backend design) of chips using only free software. Differences from commercial products. Possibilities of use in research and publications. Possibilities for further use of the results in commercial products.

Recent SOTA solutions in CV, NLP tasks and their intersections are represented mainly by transformers. But transformers usually suffer from overparametrization and require excessive computational power, which prevents them from using on edge devices and leaves significant carbon footprint. Compression includes efficient reduction of model size and computation resources and, as a result, the increase of inference speed. In this report we discuss the recent ideas on several areas of neural network compression to optimize the architecture and find smarter ways of doing inference, including knowledge distillation, pruning, decomposition, quantization and others. We’ll cover both transformers and CNNs, revealing how these architectures can be optimized.

Multiple existing neural networks operate by the rules set in the algebra of real numbers. Various tasks come to light when the original data naturally has a complex-valued format. Multiple recent works aimed to develop new architectures and building blocks for complex-valued neural networks. We generalize these models by considering other types of hypercomplex numbers of the second order: dual numbers and double numbers. We have developed basic operators for these algebras, such as convolution, activation functions, and batch normalization, and rebuilt a few real-valued networks to use them in the new algebras. We will show the results of proposed hypercomplex models.

Roofline methodology provides a simple method of analysis and visualization application characteristics and hardware capabilities on the same chart. It can detect and track down compute and cache/memory-throughput performance bottlenecks. We will overview an optimization example of stencil algorithm guided by Roofline methodology.

Cache-aware Roofline model and how we recently applied it to optimise epistasis detection (bioinformatics) on both CPUs and GPUs.
If time permits, maybe I can try to also show a bit of our recent efforts in applying CARM for sparse computing, including hardware scaling.

For the development of efficient applications for HPC roofline model can be used. While being visual and having simple mathematical expression, it contains at the same time a deep understanding of the modern issues in HPC and provides an aid in their solution. In this talk, memory wall problem is addressed, which includes stencil computing, numerical simulations and sparse linear algebra problems. In terms of the Roofline model, such problems have low arithmetic intensity and are
memory-bound in their performance. The solution for the memory wall problem involves increasing the locality of calculations. Therefore, it is valid to say that the memory wall problem has transformed into a locality wall problem. For the solution, hard and soft algorithmic methods are used.
As one of the algorithmic solutions, we are developing the method of Locally Recursive non-Locally Asynchronous (LRnLA) algorithm construction.

The simulation of complex physical processes (like filtration in an oil field) requires large computational grid with billions of cells. Due to the significant computational cost of a single simulation on a fine scale, many optimization problems requiring lots of such simulations become infeasible (e.g., history matching in oil engineering). Thus, the challenge is to develop physics-informed machine learning models that capture the complex features of the problem at the finer scales while working on a coarse scale with reduced computational cost. In this talk I will present current state of the art and discuss possible new directions.
In particular, a new promising direction is to use graph neural networks to model interaction among spatial elements participating in the physical process for the seamless upscaling using tunable graph operators.

During the last decade lots of model accuracy optimization methods were proposed. Firstly, the main part of research was focused on architectural inductive biases. Then research focus migrated to models' size. Now the most successful papers use tradeoff between model size and dataset samples quantity, but in parallel there are papers which use relatively small amounts of data combining it with special pretraining tasks and achieve significant accuracy using small models.

k-means clustering is a method that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroids), serving as a prototype of the cluster. k-means clustering minimizes within-cluster variances. Any distance measure can be used. Even if it does not satisfy metric's axioms. The exact problem is computationally difficult (NP-hard); however, efficient heuristic algorithms converge quickly to a local optimum. The standard algorithm was first proposed by Stuart Lloyd of Bell Labs in 1957.

We present a distributed implementation of optimized exact Lloyd algorithm for Apache Spark which is significantly faster than existing Spark ML implementation from Apache Spark 3. Our approach contains effective local and global pruning strategies based on some novel ideas for Euclidean and cosine distances.

A community detection algorithm Leiden (Traag et al.) significantly improved the partitioning of graphs by the modularity measure. We will discuss how to re-make it for Apache Spark and talk about some optimizations and approximations for distributed frameworks.

XGBoost library is one of the most famous machine learning libraries implementing Gradient Boosting Decision Trees algorithm. In this talk we will discuss some XGBoost optimization ideas to make training of the model faster and convergence of the algorithm better.

In classic high-performance computing domain (HPC), vast majority of computations is conducted in double precision 64-bit floating-point numbers (fp64). It was also known for a while that single precision 32-bit (fp32) calculations can be 2 to 14 times faster. The raise of Deep Neural Networks (DNNs) resulted in hardware (HW) capable of processing 16-bit floating point numbers (fp16, half precision) up to 16x faster in terms floating-point operations per second (flops) and up to 4x faster in terms of memory bandwidth (bytes/sec). At the same time, shorter mantissa and exponent for lower precision numbers may lead to very quick loss of accuracy of computations producing wrong computational results without any option to recover them. In this talk, I present initial steps completed by Kunpeng Math Library for dense linear algebra computations and future challenges associated with this direction.

In classic high-performance computing domain (HPC), one of the key challenges arises with the solution of large systems of linear algebraic equations with sparse matrices. Multiple direct and iterative methods has been developed to address this challenge in the past centuries. Modern processors like Kunpeng ones pose additional questions in terms of efficient implementation of these methods to get the result faster. Additional complexities come from sparse matrix storage formats that may fit better or worse to a particular algorithm. In this talk, I present sparse solvers implemented by Kunpeng Math Library addressing some of the challenges mentioned above and future challenges associated with this direction.

The application of lower level parallel programming models raises concerns from the complexity, correctness, and portability and maintainability perspectives. To address all mentioned concerns parallel programming approaches must raise the level of abstraction above that of lower level APIs. We present the experience we gained using the DVMH model while developing scientific applications in various fields of research. Discussing the typical steps of parallel program development we outline the opportunities the DVMH model provides to express parallelism in programs with irregular grids.

The report presents an analysis of the results of a cycle of experimental studies aimed at developing a cooling system for heat-stressed elements in microelectronics using spray irrigation.
The first part discusses the current state of research on the cooling of heat-stressed surfaces using spray irrigation, achievements and problematic issues in this area. The goals and objectives of the study under consideration on the development of a modern cooling system for models of chips of different sizes under spray irrigation with dielectric liquids in relation to the creation of Data centers electronic equipment are formulated.
The second part presents a description of the experimental stand, the methodology for conducting experiments, the ranges of the investigated operating parameters.

We designed memory allocator for Kunpeng-920 processor as part of Kunpeng General Runtime Extension & Acceleration Technology (GREAT) project. It is tuned for SPEC CPU 2017 Integer Rate benchmarks and brings 5% geomean performance against state of the art jemalloc memory allocator. And we keep develop our parallel runtime features.

This talk consists of two parts. First, we explain how it is possible to improve the performance of a concurrent index data structure almost out-of-thin air using lock-based buckets. Secondly, we show how it is possible to make a hand-over-hand locking procedure lock-free or even wait-free on hierarchical (tree) data structures.

We have developed a methodology and toolkit for architects who lead promising hardware development and we want to present it.

Elementary topos is a category that satisfies specific properties which make it similar to the set category. Toposes has a specific inner logic, and this important property can be used to provide applications of topos theory to AI. I will show some examples of toposes and some general properties for them. After that, we will discuss some ideas of topos applications to image recognition.

In his Journal article, Douglas Heavenarchive (MIT technology revie, Dec 2021) announced 2021 as "the year of monster AI models". Training such models require using very large-scale infrastructuere such as dedicated cloud. In addition, training such models in a reasonable time requires the use of dedicated Hardware.
In my talk, I will emphasize why speclialized Hardware for training is needed, present some of the currently proposed solutions, and extend the discussion on the challenges future systems still need to address.

We present a key challenges, learnings and opportunities from the efficient generative inference for Transformer-based models (like chatGPT) in the most challenging settings: huge models that do not fit memory of a single device, tight latency limits and long sequence lengths. The unique combination of the technical solutions that we discuss in this talk, coupled with multi-dimensional partitioning techniques optimized for Ascend-powered inference, unlocks the performance for the LLM models, most rapidly growing area of AI.

The new inference engine provides efficient and convenient API to perform deep learning model inference on Ascend accelerators. Pre- and post-processing library is a part of the project. Its functionality includes video and image decoding and encoding, various image processing operations, including color conversion, geometric image transformations, filtering etc. Pre- and post-processing operations can be run individually or be attached to the model graph and executed as a part of the inference. The key objectives are to provide a convenient API, to simplify the inference process, and a decent performance, not to slow down inference. For each operation several alternative implementations have been provided, for the host CPU and for Ascend. While in theory Ascend accelerators provide much higher peak performance (tens or even hundreds of teraflops) than CPU, they are mostly targeted for efficient deep learning inference, i.e. to speed up matrix multiplication, convolution, activation operation etc. It is often quite difficult to squeeze a great performance from the accelerators on image processing operations. We will provide overview of our project and the challenges we have to deliver the most efficient pre- and post-processing algorithms to the inference engine users.

TO BE CONTINUED...