Floating-Point Arithmetic

Lecturers

Claude-Pierre Jeannerod, Guillaume Melquiond, and Jean-Michel Muller.

Description

Calculations performed on a computer are built from the underlying arithmetic of the processors and thus, most often, floating-point (FP) arithmetic. In recent years, FP arithmetic has been undergoing a major and very interesting evolution, with the classical IEEE-754 specification being complemented with new formats and instructions, but also with alternative number systems. This evolution, which is mostly driven by new computational needs in artificial intelligence, is pushing towards the systematic use of very-low precision FP formats. While low-precision FP arithmetics have obvious performance advantages, they raise many questions when it comes to certifying the quality of the numerical computations that rely on them. Such correctness guarantees are becoming more and more necessary in many emerging application domains (such as autonomous vehicles), requiring in turn new approaches to the design and analysis of FP algorithms. This course will offer a timely and comprehensive treatment of all these very recent and exciting developments, covering in particular the following topics:

Foundations of computer arithmetic: number systems, arithmetic algorithms, hardware designs
The latest (2019) revision of the IEEE-754 standard for FP arithmetic, and the on-preparation revision
Beyond IEEE-754: low-precision arithmetics and tensor-processing units
New error-analysis techniques: mixed-precision analysis, probabilistic models, statistical error analysis
Fast and accurate function evaluation: algorithms, certification tools, and libm implementation
Tools for computer-assisted analysis of numerical programs

Some references

The Mathematical-Function Computation Handbook, N. Beebe, Springer, 2017.
Computer Arithmetic and Formal Proofs: Verifying Floating-point Algorithms with the Coq System, S. Boldo and G. Melquiond, ISTE Press, 2017.
Handbook of Floating-Point Arithmetic, J.-M. Muller et al., Birkhäuser, 2018.
FP8 Formats for Deep Learning, NVIDIA, ARM, Intel, 2022.
Mixed precision algorithms in numerical linear algebra, H. Higham and T. Mary, Acta Numerica, 2022.
Floating-Point Arithmetic, S. Boldo, C.-P. Jeannerod, G. Melquiond, J.-M. Muller, Acta Numerica, 2023.
Arithmetic Formats for Machine Learning, IEEE SA Working Group P3109, 2023-2024.
Hardware Trends Impacting Floating-Point Computations In Scientific Applications, J. Dongarra et al., arXiv, 2024.