Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Blickinsbuch.de - Handbook of Floating-Point Arithmetic - Jean-Michel Muller, Nicolas Brisebarre, Florent de Dinechin, Claude-Pierre Jeannerod, Lefevre Vincent, Guillaume Melquiond
     Artikel werden geladen

    Handbook of Floating-Point Arithmetic

    Handbook of Floating-Point Arithmetic

    Autoren:

    Verlag:
    Springer Basel AG  Weitere Titel dieses Verlages anzeigen

    Erschienen: Dezember 2009
    Seiten: 572
    Sprache: Englisch
    Preis: 117.69 €
    Illustration: 100 black white illustrations, 100 black white line drawings
    Maße: 264x187x43
    Einband: Gebundene Ausgabe
    ISBN: 9780817647049

    Inhaltsverzeichnis

    Prefacexv
    List of Figuresxvii
    List of Tablesxxi
    I Introduction, Basic Definitions, and Standards1
    1Introduction3
    1.1Some History3
    1.2Desirable Properties6
    1.3Some Strange Behaviors7
    1.3.1Some famous bugs7
    1.3.2Difficult problems8
    2Definitions and Basic Notions13
    2.1Floating-Point Numbers13
    2.2Rounding20
    2.2.1Rounding modes20
    2.2.2Useful properties22
    2.2.3Relative error due to rounding23
    2.3Exceptions25
    2.4 Lost or Preserved Properties of the Arithmetic on the Real
    Numbers27
    2.5Note on the Choice of the Radix29
    2.5.1Representation errors29
    2.5.2A case for radix 1030
    2.6Tools for Manipulating Floating-Point Errors32
    2.6.1The ulp function32
    2.6.2Errors in ulps and relative errors37
    2.6.3An example: iterated products37
    2.6.4Unit roundoff39
    2.7Note on Radix Conversion40
    2.7.1Conditions on the formats40
    2.7.2Conversion algorithms43
    2.8The Fused Multiply-Add (FMA) Instruction51
    2.9Interval Arithmetic51
    2.9.1Intervals with floating-point bounds52
    2.9.2Optimized rounding52
    3Floating-Point Formats and Environment55
    3.1The IEEE 754-1985 Standard56
    3.1.1Formats specified by IEEE 754-198556
    3.1.2Little-endian, big-endian60
    3.1.3Rounding modes specified by IEEE 754-198561
    3.1.4Operations specified by IEEE 754-198562
    3.1.5Exceptions specified by IEEE 754-198566
    3.1.6Special values69
    3.2The IEEE 854-1987 Standard70
    3.2.1Constraints internal to a format70
    3.2.2Various formats and the constraints between them71
    3.2.3 Conversions between floating-point numbers and
    decimal strings72
    3.2.4Rounding73
    3.2.5Operations73
    3.2.6Comparisons74
    3.2.7Exceptions74
    3.3The Need for a Revision74
    3.3.1A typical problem: "double rounding"75
    3.3.2Various ambiguities77
    3.4The New IEEE 754-2008 Standard79
    3.4.1Formats specified by the revised standard80
    3.4.2Binary interchange format encodings81
    3.4.3Decimal interchange format encodings82
    3.4.4Larger formats92
    3.4.5Extended and extendable precisions92
    3.4.6Attributes93
    3.4.7Operations specified by the standard97
    3.4.8Comparisons99
    3.4.9Conversions99
    3.4.10Default exception handling100
    3.4.11Recommended transcendental functions103
    3.5Floating-Point Hardware in Current Processors104
    3.5.1The common hardware denominator104
    3.5.2Fused multiply-add104
    3.5.3Extended precision104
    3.5.4Rounding and precision control105
    3.5.5SIMD instructions106
    3.5.6Floating-point on x86 processors: SSE2 versus x87106
    3.5.7Decimal arithmetic107
    3.6Floating-Point Hardware in Recent Graphics Processing Units108
    3.7Relations with Programming Languages109
    3.7.1The Language Independent Arithmetic (LIA) standard109
    3.7.2Programming languages110
    3.8Checking the Environment110
    3.8.1 MACHAR....................................................Ill
    3.8.2 Paranoia......................................................Ill
    3.8.3UCBTest115
    3.8.4TestFloat116
    3.8.5IeeeCC754116
    3.8.6Miscellaneous116
    II Cleverly Using Floating-Point Arithmetic117
    4Basic Properties and Algorithms119
    4.1Testing the Computational Environment119
    4.1.1Computing the radix119
    4.1.2Computing the precision121
    4.2Exact Operations122
    4.2.1Exact addition122
    4.2.2Exact multiplications and divisions124
    4.3Accurate Computations of Sums of Two Numbers125
    4.3.1The Fast2Sum algorithm126
    4.3.2The 2Sum algorithm129
    4.3.3If we do not use rounding to nearest131
    4.4Computation of Products132
    4.4.1Veltkamp splitting132
    4.4.2Dekker's multiplication algorithm135
    4.5Complex numbers139
    4.5.1Various error bounds140
    4.5.2Error bound for complex multiplication141
    4.5.3Complex division144
    4.5.4Complex square root149
    5The Fused Multiply-Add Instruction151
    5.1The 2MultFMA Algorithm152
    5.2Computation of Residuals of Division and Square Root153
    5.3Newton-Raphson-Based Division with an FMA155
    5.3.1Variants of the Newton-Raphson iteration155
    5.3.2 Using the Newton-Raphson iteration for correctly
    rounded division160
    5.4Newton-Raphson-Based Square Root with an FMA167
    5.4.1The basic iterations167
    5.4.2 Using the Newton-Raphson iteration for correctly
    rounded square roots168
    5.5Multiplication by an Arbitrary-Precision Constant171
    5.5.1 Checking for a given constant C if Algorithm 5.2 will
    always work172
    5.6Evaluation of the Error of an FMA175
    5.7Evaluation of Integer Powers177
    6 Enhanced Floating-Point Sums, Dot Products, and Polynomial
    Values181
    6.1Preliminaries182
    6.1.1Floating-point arithmetic models183
    6.1.2Notation for error analysis and classical error estimates184
    6.1.3Properties for deriving running error bounds187
    6.2Computing Validated Running Error Bounds188
    6.3Computing Sums More Accurately190
    6.3.1Reordering the operands, and a bit more190
    6.3.2Compensated sums192
    6.3.3Implementing a "long accumulator"199
    6.3.4On the sum of three floating-point numbers199
    6.4Compensated Dot Products201
    6.5Compensated Polynomial Evaluation203
    7Languages and Compilers205
    7.1A Play with Many Actors205
    7.1.1Floating-point evaluation in programming languages206
    7.1.2Processors, compilers, and operating systems208
    7.1.3In the hands of the programmer209
    7.2Floating Point in the C Language209
    7.2.1Standard C99 headers and IEEE 754-1985 support209
    7.2.2Types210
    7.2.3Expression evaluation213
    7.2.4Code transformations216
    7.2.5Enabling unsafe optimizations217
    7.2.6Summary: a few horror stories218
    7.3Floating-Point Arithmetic in the C++ Language220
    7.3.1Semantics220
    7.3.2Numeric limits221
    7.3.3Overloaded functions222
    7.4FORTRAN Floating Point in a Nutshell223
    7.4.1Philosophy223
    7.4.2IEEE 754 support in FORTRAN226
    7.5Java Floating Point in a Nutshell227
    7.5.1Philosophy227
    7.5.2Types and classes228
    7.5.3Infinities, NaNs, and signed zeros230
    7.5.4Missing features231
    7.5.5Reproducibility232
    7.5.6The BigDecimal package233
    7.6Conclusion234
    III Implementing Floating-Point Operators237
    8Algorithms for the Five Basic Operations239
    8.1Overview of Basic Operation Implementation239
    8.2Implementing IEEE 754-2008 Rounding241
    8.2.1 Rounding a nonzero finite value with unbounded
    exponent range241
    8.2.2Overflow243
    8.2.3Underflow and subnormal results244
    8.2.4The inexact exception245
    8.2.5Rounding for actual operations245
    8.3Floating-Point Addition and Subtraction246
    8.3.1Decimal addition249
    8.3.2Decimal addition using binary encoding250
    8.3.3Subnormal inputs and outputs in binary addition251
    8.4Floating-Point Multiplication251
    8.4.1Normal case252
    8.4.2Handling subnormal numbers in binary multiplication252
    8.4.3Decimal specifics253
    8.5Floating-Point Fused Multiply-Add254
    8.5.1Case analysis for normal inputs254
    8.5.2Handling subnormal inputs258
    8.5.3Handling decimal cohorts259
    8.5.4Overview of a binary FMA implementation259
    8.6Floating-Point Division262
    8.6.1Overview and special cases262
    8.6.2Computing the significand quotient263
    8.6.3Managing subnormal numbers264
    8.6.4The inexact exception265
    8.6.5Decimal specifics265
    8.7Floating-Point Square Root265
    8.7.1Overview and special cases265
    8.7.2Computing the significand square root266
    8.7.3Managing subnormal numbers267
    8.7.4The inexact exception267
    8.7.5Decimal specifics267
    9Hardware Implementation of Floating-Point Arithmetic269
    9.1Introduction and Context269
    9.1.1Processor internal formats269
    9.1.2Hardware handling of subnormal numbers270
    9.1.3Full-custom VLSI versus reconfigurable circuits271
    9.1.4Hardware decimal arithmetic272
    9.1.5Pipelining273
    9.2The Primitives and Their Cost274
    9.2.1Integer adders274
    9.2.2Digit-by-integer multiplication in hardware280
    9.2.3Using nonstandard representations of numbers280
    9.2.4Binary integer multiplication281
    9.2.5Decimal integer multiplication283
    9.2.6Shifters284
    9.2.7Leading-zero counters284
    9.2.8 Tables and table-based methods for fixed-point
    function approximation286
    9.3Binary Floating-Point Addition288
    9.3.1Overview288
    9.3.2A first dual-path architecture289
    9.3.3Leading-zero anticipation291
    9.3.4Probing further on floating-point adders295
    9.4Binary Floating-Point Multiplication296
    9.4.1Basic architecture296
    9.4.2FPGA implementation296
    9.4.3VLSI implementation optimized for delay298
    9.4.4Managing subnormals301
    9.5Binary Fused Multiply-Add302
    9.5.1Classic architecture303
    9.5.2To probe further305
    9.6Division305
    9.6.1Digit-recurrence division306
    9.6.2Decimal division309
    9.7Conclusion: Beyond the FPU309
    9.7.1Optimization in context of standard operators310
    9.7.2Operation with a constant operand311
    9.7.3Block floating point313
    9.7.4Specific architectures for accumulation313
    9.7.5Coarser-grain operators317
    9.8Probing Further320
    10Software Implementation of Floating-Point Arithmetic321
    10.1Implementation Context322
    10.1.1Standard encoding of binary floating-point data322
    10.1.2Available integer operators323
    10.1.3First examples326
    10.1.4Design choices and optimizations328
    10.2Binary Floating-Point Addition329
    10.2.1Flandling special values330
    10.2.2Computing the sign of the result332
    10.2.3 Swapping the operands and computing the alignment
    shift333
    10.2.4Getting the correctly rounded result335
    10.3Binary Floating-Point Multiplication341
    10.3.1Handling special values341
    10.3.2Sign and exponent computation343
    10.3.3Overflow detection345
    10.3.4Getting the correctly rounded result346
    10.4Binary Floating-Point Division349
    10.4.1Handling special values350
    10.4.2Sign and exponent computation351
    10.4.3Overflow detection354
    10.4.4Getting the correctly rounded result355
    10.5Binary Floating-Point Square Root361
    10.5.1Handling special values362
    10.5.2Exponent computation364
    10.5.3Getting the correctly rounded result365
    IV Elementary Functions373
    11Evaluating Floating-Point Elementary Functions375
    11.1Basic Range Reduction Algorithms379
    11.1.1Cody and Waite's reduction algorithm379
    11.1.2Payne and Hanek's algorithm381
    11.2Bounding the Relative Error of Range Reduction382
    11.3More Sophisticated Range Reduction Algorithms384
    11.3.1 An example of range reduction for the exponential
    function386
    11.3.2An example of range reduction for the logarithm387
    11.4Polynomial or Rational Approximations388
    11.4.1I? case389
    11.4.2L00, or minimax case390
    11.4.3"Truncated" approximations392
    11.5Evaluating Polynomials393
    11.6Correct Rounding of Elementary Functions to binary64394
    11.6.1 The Table Maker's Dilemma and Ziv's onion peeling
    strategy394
    11.6.2When the TMD is solved395
    11.6.3Rounding test396
    11.6.4Accurate second step400
    11.6.5Error analysis and the accuracy/performance tradeoff401
    11.7Computing Error Bounds402
    11.7.1The point with efficient code402
    11.7.2Example: a "double-double" polynomial evaluation403
    12Solving the Table Maker's Dilemma405
    12.1Introduction405
    12.1.1The Table Maker's Dilemma406
    12.1.2Brief history of the TMD410
    12.1.3Organization of the chapter411
    12.2Preliminary Remarks on the Table Maker's Dilemma412
    12.2.1Statistical arguments: what can be expected in practice412
    12.2.2In some domains, there is no need to find worst cases416
    12.2.3 Deducing the worst cases from other functions or
    domains419
    12.3The Table Maker's Dilemma for Algebraic Functions420
    12.3.1Algebraic and transcendental numbers and functions420
    12.3.2The elementary case of quotients422
    12.3.3Around Liouville's theorem424
    12.3.4 Generating bad rounding cases for the square root
    using Hensel 2-adic lifting425
    12.4Solving the Table Maker's Dilemma for Arbitrary Functions429
    12.4.1 Lindemann's theorem: application to some
    transcendental functions429
    12.4.2A theorem of Nesterenko and Waldschmidt430
    12.4.3A first method: tabulated differences432
    12.4.4 From the TMD to the distance between a grid and a
    segment434
    12.4.5Linear approximation: Lefevre's algorithm436
    12.4.6The SLZ algorithm443
    12.4.7Periodic functions on large arguments448
    12.5Some Results449
    12.5.1 Worst cases for the exponential, logarithmic,
    trigonometric, and hyperbolic functions449
    12.5.2A special case: integer powers458
    12.6Current Limits and Perspectives458
    V Extensions461
    13Formalisms for Certifying Floating-Point Algorithms463
    13.1Formalizing Floating-Point Arithmetic463
    13.1.1Defining floating-point numbers464
    13.1.2Simplifying the definition466
    13.1.3Defining rounding operators467
    13.1.4Extending the set of numbers470
    13.2Formalisms for Certifying Algorithms by Hand471
    13.2.1Hardware units471
    13.2.2Low-level algorithms472
    13.2.3Advanced algorithms473
    13.3Automating Proofs474
    13.3.1Computing on bounds475
    13.3.2Counting digits477
    13.3.3Manipulating expressions479
    13.3.4Handling the relative error483
    13.4Using Gappa484
    13.4.1Toy implementation of sine484
    13.4.2Integer division on Itanium488
    14Extending the Precision493
    14.1Double-Words, Triple-Words494
    14.1.1Double-word arithmetic495
    14.1.2Static triple-word arithmetic498
    14.1.3Quad-word arithmetic500
    14.2Floating-Point Expansions503
    14.3Floating-Point Numbers with Batched Additional Exponent509
    14.4Large Precision Relying on Processor Integers510
    14.4.1 Using arbitrary-precision integer arithmetic for
    arbitrary-precision floating-point arithmetic512
    14.4.2 A brief introduction to arbitrary-precision integer
    arithmetic513
    VI Perspectives and Appendix517
    15Conclusion and Perspectives519
    16Appendix: Number Theory Tools for Floating-Point Arithmetic521
    16.1Continued Fractions521
    16.2The LLL Algorithm524
    Bibliography529
    Index567

    Register

    2Mul, 135,318 2MultFMA, 152 2Sum, 129, 318
    accumulator, 314
    accurate step, 396
    ACL2,471
    addition, 246
    of binary floating-point in hardware, 288
    of binary floating-point in software, 329
    of integers, 274
    of integers in decimal, 275
    of signed zeros, 247
    subnormal handling, 251, 294
    additive range reduction, 378
    Al-Khwarizmi, 167
    algebraic function, 421,424
    algebraic number, 420
    a (smallest normal number), 17,153
    alternate exception-handling attributes, 93,95
    argument reduction, 378
    arithmetic formats, 80
    ARRE (average relative representation error), 30
    attributes, 93
    Babai's algorithm, 528
    Babylonians, 4
    backward error, 186
    bad cases for the TMD, 409
    base, 13
    basic formats, 56, 80
    BCD (binary coded decimal), 83
    Benford's law, 29
    bias, 57, 82, 84,85
    biased exponent, 58-60, 84-86, 245
    big-endian, 61
    binade, 415
    binary 128, 82
    binaryl6, 82
    binary32,16, 82
    binary64, 82
    binding, 109
    bipartite table method, 287
    block floating-point, 313
    Booth recoding, 281
    breakpoint, 21, 406
    Briggs, 375
    Burger and Dybvig conversion algo- rithm, 44
    C programming language, 209
    C++ programming language, 220
    C99, 210
    cancellation, 124,193
    canonical encoding, 83, 85
    carry-ripple adder, 274
    carry-select adder, 279
    carry-skip adder, 276, 277
    catastrophic cancellation, 124, 378
    CENA, 182
    Chebyshev polynomials, 389
    theorem, 391
    dinger conversion algorithms, 46
    close path, 249
    closest vector problem, 525
    Cody, 29
    Cody and Waite reduction algorithm, 379
    cohort, 14, 83, 97, 240
    combination field, 83, 84
    comparisons, 65, 99
    comparison predicates, 65, 99
    CompensatedDotProduct algorithm, 202
    compensated algorithms, 182


    compensated polynomial evaluation, 203
    compensated summation, 192
    component of an expansion, 503
    compound adder, 278, 299
    compression of expansions, 508
    condition numbers, 187
    continued fractions, 382, 521, 522
    contracted expressions, 214
    convergent (continued fractions), 522
    conversion algorithms, 43
    Coq, 472
    CORDIC algorithm, 375
    correctly rounded function, 21, 22
    CRlibm, 381
    CVP, see closest vector problem data dependency, 273
    DblMult, 178
    decimal addition, 275
    decimal arithmetic in hardware, 272
    decimal division, 309
    decimal multiplication, 283
    decimal encoding, 85
    declet, 83
    degree of an algebraic number, 421
    Dekker, 126,135
    Dekker product, 125,135,473
    delay, 273
    denormal number, 15
    directed rounding modes, 22
    division, 262
    SRT algorithms, 308
    by a constant, 312
    by digit recurrence, 263, 306
    in decimal, 309
    in hardware, 305
    division by zero, 25, 67,101
    double-double numbers, 403
    double-word addition, 497
    double-word multiplication, 498
    double precision, 56, 57, 61, 64, 65, 71, 82
    double rounding, 75, 77,114
    DSP (digital signal processing), 297, 316
    dynamic range, 30
    elementary function, 421
    ^max/ 14
    ^min/ 14
    end-around carry adder, 279
    endianness, 61
    Ercegovac, 263
    ErrFma, 176
    Estrin's method, 394
    Euclidean lattice, 446, 524
    exactly rounded function, 21
    exceptions, 25, 66, 74, 95,100, 475
    exclusion lemma, 162, 422, 423
    exclusion zone, 162
    expansion, 503
    Expansion-Sum algorithm, 505
    exponent bias, 85
    extendable precision, 80, 92
    extended formats, 56
    extended precision, 71, 72, 80, 92, 94
    extremal exponents, 14
    faithful arithmetic, 22,131
    faithful result, 22,179,311
    faithful rounding, 22
    far path, 249
    Fast-Expansion-Sum algorithm, 506
    Fast2Sum, 126,127
    <fenv.h>, 210
    field-programmable gate array, 271, 279, 287
    fixed point, 313, 314, 317,477
    FLIP, 321
    <float.h>, 210
    FMA, 51,104, 254, 472
    binary implementation, 259
    decimal, 259
    hardware implementation, 302
    subnormal handling, 258, 305
    FORTRAN, 223
    FPGA, see field-programmable gate array FPGA specific operators, 309
    fraction, 16, 56
    full adder, 275
    fused multiply-add, see FMA
    gamma function, 409 7 notation, 184
    Gappa, 474
    Gay conversion algorithms, 44, 46
    Goldschmidt iteration, 160
    GPGPU (general-purpose graphics pro- cessing units), 108
    GPU (graphical processing unit), 108,271
    graceful underflow, 17
    gradual underflow, 17, 53

    Grow-Expansion algorithm, 505
    Haar condition, 392
    hardness to round, 409
    Harrison, 119
    Heron iteration, 167
    hidden bit convention, 16
    Higham, 190
    HOL Light, 472
    Horner algorithm, 185, 394, 473
    running error bound, 189
    IeeeCC754,116
    IEEE 754-1985 standard, 5, 55, 56
    IEEE 754-2008 standard, 79
    IEEE 854-1987 standard, 6, 70
    ILP (instruction-level parallelism), 328
    implicit bit convention, 16, 29, 30
    inclusion property, 51
    inexact exception, 25, 69, 102, 103, 245, 265, 267
    infinitary, 25
    infinitely precise significand, 15, 21, 407, 422, 423
    insertion summation, 191
    integer multiplication, 281
    integer powers, 177
    integral significand, 14, 16, 83, 86, 422, 423
    interchange formats, 80
    interval arithmetic, 51, 475
    INTLAB, 510
    invalid operation exception, 20, 25, 67, 69, 82,100
    is normal bit, 245, 295, 301, 303, 322
    IteratedProductPower, 178
    iterated products, 37
    Java, 227
    A-fold summation algorithm, 196
    Kahan, 5, 8,17, 32,126,405
    Karatsuba's complex multiplication, 144
    Knuth, 129
    Kulisch, 316
    L2 polynomial approximations, 376
    L°° polynomial approximations, 376
    Lang, 263
    language, 205
    large accumulator, 314
    largest finite number, 16,17, 67,102
    latency, 273
    leading bit convention, 16
    leading-zero anticipation, 286, 291, 295, 303
    leading-zero counter, 284, 289-291, 293, 319
    by monotonic string conversion, 285
    combined with shifter, 286
    tree-based, 285
    least squares polynomial approxima- tions, 376
    left-associativity, 207
    LIA-2, 25
    Lindemann's theorem, 429
    Liouville's theorem, 424
    little-endian, 61
    LLL algorithm, 442, 446, 524
    logarithmic distribution, 29
    logB, 98,101
    look-up table, 279, 287, 311
    LOP (leading one predictor), see leading zero anticipation LSB (least significant bit), 290
    LUT, see look-up table LZA, see leading zero anticipation LZC, see leading zero counter MACHAR, 111
    machine epsilon, 39
    Malcolm, 119
    mantissa, 14
    Markstein, 169
    Mars Climate Orbiter, 8
    <math.h>, 210
    Matula, 40,41
    minimal polynomial, 421
    minimax polynomial approximations, 376
    minimax rational approximations, 376
    modified Booth recoding, 281
    Moller, 129
    monotonic conversion, 64
    MPCHECK, 116
    MRRE (maximum relative representa- tion error), 29
    MSB (most significant bit), 285
    multipartite table method, 287
    multiplication, 251

    of binary floating-point in hardware, 296
    by a constant, 311
    by a floating-point constant, 312
    by an arbitrary precision constant, 171, 312
    digit by integer, 280
    in decimal, 283
    of integers, 281
    subnormal handling, 252, 301
    multiplicative range reduction, 378
    NaN (Not a Number), 20, 25, 58, 65, 67, 69, 70, 74, 82, 85, 98, 100, 212, 221, 230, 464
    Nesterenko, 431
    Newton-Raphson iteration, 155, 160, 167, 264, 513
    nonadjacent expansion, 504
    noncanonical encodings, 84
    nonoverlapping expansions, 504
    'P-nonoverlapping, 504 5-nonoverlapping, 504
    nonoverlapping floating-point numbers, 504
    P-nonoverlapping, 504 5-nonoverlapping, 504
    normalized representation, 15
    normal number, 15,16
    normal range, 23
    norm (computation of), 26, 310
    Q (largest finite FP number), 16
    orthogonal polynomials, 389
    Oughtred, 4
    output radix conversion, 43
    overflow, 25, 67,101
    in addition, 248
    parallel adders, 277
    Paranoia, 112,122
    partial carry save, 277
    payload, 98,100
    Payne and Hanek reduction algorithm, 381, 382
    pipeline, 273
    pole, 25
    pow function, 216
    precision, 13
    preferred exponent, 83,240,249,253,259, 265, 267
    preferred width attributes, 93, 95
    prefix tree adders, 278
    Priest, 503
    programming language, 205
    PVS, 472
    quad-word addition, 502
    quad-word renormalization, 500
    quadratic convergence, 156
    quantum, 14,16, 33
    quantum exponent, 14
    quick step, 395
    quiet NaN, 58, 67, 69, 70, 82, 98,100,101, 212, 221
    radix, 13
    radix conversion, 40,43, 246
    range reduction, 151, 378, 379
    RD (round down), see round toward -oo read-only memory, 287
    reconfigurable circuits, see field- programmable gate array RecursiveDotProduct algorithm, 185
    RecursiveSum algorithm, 184
    relative backward error, 186
    relative error, 23, 37
    remainder, 63
    Remez algorithm, 376,391
    reproducibility attributes, 93, 97
    RN , see round to nearest ROM, see read-only memory round bit, 21, 243
    round digit, 243
    rounding, 241
    a value with unbounded exponent range, 241
    in decimal with binary encoding, 246
    by injection, 298
    division, 241
    in binary, 243
    in decimal, 243
    square root, 241, 472
    rounding breakpoint, 21,406
    rounding direction attributes, 20, 93,94
    rounding modes, 20
    roundTiesToAway, 95
    roundTiesToEven, 94, 95

    roundTowardNegative, 94
    roundTowardPositive, 94
    roundTowardZero, 94
    round toward -boo, 20, 52
    round toward -oo, 20, 52
    round toward zero, 21
    round to nearest, 21
    round to nearest even, 21, 62
    RU (round up), see round toward +oc Rump, 12
    running error bounds, 181
    RZ, see round toward zero Scale-Expansion algorithm, 507
    scaleB, 98
    SETUN computer, 5
    Shewchuk, 126,129, 503
    shift-and-add algorithms, 375
    shortest vector problem, 525
    signaling NaN, 58,67,69, 70,82,100,212, 217, 221
    signed infinities, 20
    signed zeros, 20
    significand, 3, 4,13,14,16
    significand alignment, 248
    single precision, 16, 56-58, 61, 64, 65, 71, 82
    slide rule, 4
    SLZ algorithm, 443
    smallest normal number, 16,17
    smallest subnormal number, 17,153
    SoftFloat, 116,321
    square root, 265
    SRT division, 263, 308
    SRTEST, 116
    SSE2, see SSE
    SSE (Streaming SIMD Extension), 53, 76, 106
    standard model of floating-point arith- metic, 183
    status flag, 66
    Steele and White conversion algorithm, 41,44
    Sterbenz's lemma, 122
    sticky bit, 21, 243
    strongly nonoverlapping expansion, 505
    subnormal number, 15-17, 58, 122-124, 128,133,135
    subnormal range, 23
    subtraction, 246
    SVP, see shortest vector problem 6 notation, 184
    table-based methods, 287
    Table Maker's Dilemma, 179,405-407
    tabulated differences, 432
    TestFloat, 116
    three-distance theorem, 437,438
    tie-breaking rule, 21
    Torres y Quevedo, Leonardo, 4
    trailing significand, 16, 56, 59, 60, 81, 82, 84, 85
    transcendental function, 421
    transcendental number, 421,429
    trap, 19, 66-69, 74
    trap handler, 66, 68, 69, 74
    Tuckerman test, 169
    two-length configurations, 438
    TwoMul, 135, 318
    TwoMultFMA, 152
    TwoSum, 129, 318
    UCBTest, 116
    ulp (unit in the last place), 14, 32, 37, 43, 169, 382
    Goldberg definition, 33
    Harrison definition, 32
    underflow, 18, 25, 68,102
    in addition, 248
    unit roundoff, 25, 39,183
    unordered, 65
    unsigned infinity, 20
    value-changing optimization attributes, 93, 96
    Veltkamp splitting, 132,133
    VLIW, 328
    VLSI (very large-scale integration), 270, 271, 287
    Waldschmidt, 431
    weight function, 376
    Weil height, 430
    worst cases for the TMD, 409
    write-read cycle, 41
    YBC 7289, 4

    Z3 computer, 4, 20 Ziv, 407
    zero Zuse, 4, 20
    in the binary interchange formats, 82
    in the decimal interchange formats, 85