Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241 Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 in /web/Sites/BlickinsBuch.de/functions.php on line 241
Autoren:
Verlag:
Springer Basel AG Weitere Titel dieses Verlages anzeigen
Preface | xv | |
List of Figures | xvii | |
List of Tables | xxi | |
I Introduction, Basic Definitions, and Standards | 1 | |
1 | Introduction | 3 |
1.1 | Some History | 3 |
1.2 | Desirable Properties | 6 |
1.3 | Some Strange Behaviors | 7 |
1.3.1 | Some famous bugs | 7 |
1.3.2 | Difficult problems | 8 |
2 | Definitions and Basic Notions | 13 |
2.1 | Floating-Point Numbers | 13 |
2.2 | Rounding | 20 |
2.2.1 | Rounding modes | 20 |
2.2.2 | Useful properties | 22 |
2.2.3 | Relative error due to rounding | 23 |
2.3 | Exceptions | 25 |
2.4 Lost or Preserved Properties of the Arithmetic on the Real | ||
Numbers | 27 | |
2.5 | Note on the Choice of the Radix | 29 |
2.5.1 | Representation errors | 29 |
2.5.2 | A case for radix 10 | 30 |
2.6 | Tools for Manipulating Floating-Point Errors | 32 |
2.6.1 | The ulp function | 32 |
2.6.2 | Errors in ulps and relative errors | 37 |
2.6.3 | An example: iterated products | 37 |
2.6.4 | Unit roundoff | 39 |
2.7 | Note on Radix Conversion | 40 |
2.7.1 | Conditions on the formats | 40 |
2.7.2 | Conversion algorithms | 43 |
2.8 | The Fused Multiply-Add (FMA) Instruction | 51 |
2.9 | Interval Arithmetic | 51 |
2.9.1 | Intervals with floating-point bounds | 52 |
2.9.2 | Optimized rounding | 52 |
3 | Floating-Point Formats and Environment | 55 |
3.1 | The IEEE 754-1985 Standard | 56 |
3.1.1 | Formats specified by IEEE 754-1985 | 56 |
3.1.2 | Little-endian, big-endian | 60 |
3.1.3 | Rounding modes specified by IEEE 754-1985 | 61 |
3.1.4 | Operations specified by IEEE 754-1985 | 62 |
3.1.5 | Exceptions specified by IEEE 754-1985 | 66 |
3.1.6 | Special values | 69 |
3.2 | The IEEE 854-1987 Standard | 70 |
3.2.1 | Constraints internal to a format | 70 |
3.2.2 | Various formats and the constraints between them | 71 |
3.2.3 Conversions between floating-point numbers and | ||
decimal strings | 72 | |
3.2.4 | Rounding | 73 |
3.2.5 | Operations | 73 |
3.2.6 | Comparisons | 74 |
3.2.7 | Exceptions | 74 |
3.3 | The Need for a Revision | 74 |
3.3.1 | A typical problem: "double rounding" | 75 |
3.3.2 | Various ambiguities | 77 |
3.4 | The New IEEE 754-2008 Standard | 79 |
3.4.1 | Formats specified by the revised standard | 80 |
3.4.2 | Binary interchange format encodings | 81 |
3.4.3 | Decimal interchange format encodings | 82 |
3.4.4 | Larger formats | 92 |
3.4.5 | Extended and extendable precisions | 92 |
3.4.6 | Attributes | 93 |
3.4.7 | Operations specified by the standard | 97 |
3.4.8 | Comparisons | 99 |
3.4.9 | Conversions | 99 |
3.4.10 | Default exception handling | 100 |
3.4.11 | Recommended transcendental functions | 103 |
3.5 | Floating-Point Hardware in Current Processors | 104 |
3.5.1 | The common hardware denominator | 104 |
3.5.2 | Fused multiply-add | 104 |
3.5.3 | Extended precision | 104 |
3.5.4 | Rounding and precision control | 105 |
3.5.5 | SIMD instructions | 106 |
3.5.6 | Floating-point on x86 processors: SSE2 versus x87 | 106 |
3.5.7 | Decimal arithmetic | 107 |
3.6 | Floating-Point Hardware in Recent Graphics Processing Units | 108 |
3.7 | Relations with Programming Languages | 109 |
3.7.1 | The Language Independent Arithmetic (LIA) standard | 109 |
3.7.2 | Programming languages | 110 |
3.8 | Checking the Environment | 110 |
3.8.1 MACHAR....................................................Ill | ||
3.8.2 Paranoia......................................................Ill | ||
3.8.3 | UCBTest | 115 |
3.8.4 | TestFloat | 116 |
3.8.5 | IeeeCC754 | 116 |
3.8.6 | Miscellaneous | 116 |
II Cleverly Using Floating-Point Arithmetic | 117 | |
4 | Basic Properties and Algorithms | 119 |
4.1 | Testing the Computational Environment | 119 |
4.1.1 | Computing the radix | 119 |
4.1.2 | Computing the precision | 121 |
4.2 | Exact Operations | 122 |
4.2.1 | Exact addition | 122 |
4.2.2 | Exact multiplications and divisions | 124 |
4.3 | Accurate Computations of Sums of Two Numbers | 125 |
4.3.1 | The Fast2Sum algorithm | 126 |
4.3.2 | The 2Sum algorithm | 129 |
4.3.3 | If we do not use rounding to nearest | 131 |
4.4 | Computation of Products | 132 |
4.4.1 | Veltkamp splitting | 132 |
4.4.2 | Dekker's multiplication algorithm | 135 |
4.5 | Complex numbers | 139 |
4.5.1 | Various error bounds | 140 |
4.5.2 | Error bound for complex multiplication | 141 |
4.5.3 | Complex division | 144 |
4.5.4 | Complex square root | 149 |
5 | The Fused Multiply-Add Instruction | 151 |
5.1 | The 2MultFMA Algorithm | 152 |
5.2 | Computation of Residuals of Division and Square Root | 153 |
5.3 | Newton-Raphson-Based Division with an FMA | 155 |
5.3.1 | Variants of the Newton-Raphson iteration | 155 |
5.3.2 Using the Newton-Raphson iteration for correctly | ||
rounded division | 160 | |
5.4 | Newton-Raphson-Based Square Root with an FMA | 167 |
5.4.1 | The basic iterations | 167 |
5.4.2 Using the Newton-Raphson iteration for correctly | ||
rounded square roots | 168 | |
5.5 | Multiplication by an Arbitrary-Precision Constant | 171 |
5.5.1 Checking for a given constant C if Algorithm 5.2 will | ||
always work | 172 | |
5.6 | Evaluation of the Error of an FMA | 175 |
5.7 | Evaluation of Integer Powers | 177 |
6 Enhanced Floating-Point Sums, Dot Products, and Polynomial | ||
Values | 181 | |
6.1 | Preliminaries | 182 |
6.1.1 | Floating-point arithmetic models | 183 |
6.1.2 | Notation for error analysis and classical error estimates | 184 |
6.1.3 | Properties for deriving running error bounds | 187 |
6.2 | Computing Validated Running Error Bounds | 188 |
6.3 | Computing Sums More Accurately | 190 |
6.3.1 | Reordering the operands, and a bit more | 190 |
6.3.2 | Compensated sums | 192 |
6.3.3 | Implementing a "long accumulator" | 199 |
6.3.4 | On the sum of three floating-point numbers | 199 |
6.4 | Compensated Dot Products | 201 |
6.5 | Compensated Polynomial Evaluation | 203 |
7 | Languages and Compilers | 205 |
7.1 | A Play with Many Actors | 205 |
7.1.1 | Floating-point evaluation in programming languages | 206 |
7.1.2 | Processors, compilers, and operating systems | 208 |
7.1.3 | In the hands of the programmer | 209 |
7.2 | Floating Point in the C Language | 209 |
7.2.1 | Standard C99 headers and IEEE 754-1985 support | 209 |
7.2.2 | Types | 210 |
7.2.3 | Expression evaluation | 213 |
7.2.4 | Code transformations | 216 |
7.2.5 | Enabling unsafe optimizations | 217 |
7.2.6 | Summary: a few horror stories | 218 |
7.3 | Floating-Point Arithmetic in the C++ Language | 220 |
7.3.1 | Semantics | 220 |
7.3.2 | Numeric limits | 221 |
7.3.3 | Overloaded functions | 222 |
7.4 | FORTRAN Floating Point in a Nutshell | 223 |
7.4.1 | Philosophy | 223 |
7.4.2 | IEEE 754 support in FORTRAN | 226 |
7.5 | Java Floating Point in a Nutshell | 227 |
7.5.1 | Philosophy | 227 |
7.5.2 | Types and classes | 228 |
7.5.3 | Infinities, NaNs, and signed zeros | 230 |
7.5.4 | Missing features | 231 |
7.5.5 | Reproducibility | 232 |
7.5.6 | The BigDecimal package | 233 |
7.6 | Conclusion | 234 |
III Implementing Floating-Point Operators | 237 | |
8 | Algorithms for the Five Basic Operations | 239 |
8.1 | Overview of Basic Operation Implementation | 239 |
8.2 | Implementing IEEE 754-2008 Rounding | 241 |
8.2.1 Rounding a nonzero finite value with unbounded | ||
exponent range | 241 | |
8.2.2 | Overflow | 243 |
8.2.3 | Underflow and subnormal results | 244 |
8.2.4 | The inexact exception | 245 |
8.2.5 | Rounding for actual operations | 245 |
8.3 | Floating-Point Addition and Subtraction | 246 |
8.3.1 | Decimal addition | 249 |
8.3.2 | Decimal addition using binary encoding | 250 |
8.3.3 | Subnormal inputs and outputs in binary addition | 251 |
8.4 | Floating-Point Multiplication | 251 |
8.4.1 | Normal case | 252 |
8.4.2 | Handling subnormal numbers in binary multiplication | 252 |
8.4.3 | Decimal specifics | 253 |
8.5 | Floating-Point Fused Multiply-Add | 254 |
8.5.1 | Case analysis for normal inputs | 254 |
8.5.2 | Handling subnormal inputs | 258 |
8.5.3 | Handling decimal cohorts | 259 |
8.5.4 | Overview of a binary FMA implementation | 259 |
8.6 | Floating-Point Division | 262 |
8.6.1 | Overview and special cases | 262 |
8.6.2 | Computing the significand quotient | 263 |
8.6.3 | Managing subnormal numbers | 264 |
8.6.4 | The inexact exception | 265 |
8.6.5 | Decimal specifics | 265 |
8.7 | Floating-Point Square Root | 265 |
8.7.1 | Overview and special cases | 265 |
8.7.2 | Computing the significand square root | 266 |
8.7.3 | Managing subnormal numbers | 267 |
8.7.4 | The inexact exception | 267 |
8.7.5 | Decimal specifics | 267 |
9 | Hardware Implementation of Floating-Point Arithmetic | 269 |
9.1 | Introduction and Context | 269 |
9.1.1 | Processor internal formats | 269 |
9.1.2 | Hardware handling of subnormal numbers | 270 |
9.1.3 | Full-custom VLSI versus reconfigurable circuits | 271 |
9.1.4 | Hardware decimal arithmetic | 272 |
9.1.5 | Pipelining | 273 |
9.2 | The Primitives and Their Cost | 274 |
9.2.1 | Integer adders | 274 |
9.2.2 | Digit-by-integer multiplication in hardware | 280 |
9.2.3 | Using nonstandard representations of numbers | 280 |
9.2.4 | Binary integer multiplication | 281 |
9.2.5 | Decimal integer multiplication | 283 |
9.2.6 | Shifters | 284 |
9.2.7 | Leading-zero counters | 284 |
9.2.8 Tables and table-based methods for fixed-point | ||
function approximation | 286 | |
9.3 | Binary Floating-Point Addition | 288 |
9.3.1 | Overview | 288 |
9.3.2 | A first dual-path architecture | 289 |
9.3.3 | Leading-zero anticipation | 291 |
9.3.4 | Probing further on floating-point adders | 295 |
9.4 | Binary Floating-Point Multiplication | 296 |
9.4.1 | Basic architecture | 296 |
9.4.2 | FPGA implementation | 296 |
9.4.3 | VLSI implementation optimized for delay | 298 |
9.4.4 | Managing subnormals | 301 |
9.5 | Binary Fused Multiply-Add | 302 |
9.5.1 | Classic architecture | 303 |
9.5.2 | To probe further | 305 |
9.6 | Division | 305 |
9.6.1 | Digit-recurrence division | 306 |
9.6.2 | Decimal division | 309 |
9.7 | Conclusion: Beyond the FPU | 309 |
9.7.1 | Optimization in context of standard operators | 310 |
9.7.2 | Operation with a constant operand | 311 |
9.7.3 | Block floating point | 313 |
9.7.4 | Specific architectures for accumulation | 313 |
9.7.5 | Coarser-grain operators | 317 |
9.8 | Probing Further | 320 |
10 | Software Implementation of Floating-Point Arithmetic | 321 |
10.1 | Implementation Context | 322 |
10.1.1 | Standard encoding of binary floating-point data | 322 |
10.1.2 | Available integer operators | 323 |
10.1.3 | First examples | 326 |
10.1.4 | Design choices and optimizations | 328 |
10.2 | Binary Floating-Point Addition | 329 |
10.2.1 | Flandling special values | 330 |
10.2.2 | Computing the sign of the result | 332 |
10.2.3 Swapping the operands and computing the alignment | ||
shift | 333 | |
10.2.4 | Getting the correctly rounded result | 335 |
10.3 | Binary Floating-Point Multiplication | 341 |
10.3.1 | Handling special values | 341 |
10.3.2 | Sign and exponent computation | 343 |
10.3.3 | Overflow detection | 345 |
10.3.4 | Getting the correctly rounded result | 346 |
10.4 | Binary Floating-Point Division | 349 |
10.4.1 | Handling special values | 350 |
10.4.2 | Sign and exponent computation | 351 |
10.4.3 | Overflow detection | 354 |
10.4.4 | Getting the correctly rounded result | 355 |
10.5 | Binary Floating-Point Square Root | 361 |
10.5.1 | Handling special values | 362 |
10.5.2 | Exponent computation | 364 |
10.5.3 | Getting the correctly rounded result | 365 |
IV Elementary Functions | 373 | |
11 | Evaluating Floating-Point Elementary Functions | 375 |
11.1 | Basic Range Reduction Algorithms | 379 |
11.1.1 | Cody and Waite's reduction algorithm | 379 |
11.1.2 | Payne and Hanek's algorithm | 381 |
11.2 | Bounding the Relative Error of Range Reduction | 382 |
11.3 | More Sophisticated Range Reduction Algorithms | 384 |
11.3.1 An example of range reduction for the exponential | ||
function | 386 | |
11.3.2 | An example of range reduction for the logarithm | 387 |
11.4 | Polynomial or Rational Approximations | 388 |
11.4.1 | I? case | 389 |
11.4.2 | L00, or minimax case | 390 |
11.4.3 | "Truncated" approximations | 392 |
11.5 | Evaluating Polynomials | 393 |
11.6 | Correct Rounding of Elementary Functions to binary64 | 394 |
11.6.1 The Table Maker's Dilemma and Ziv's onion peeling | ||
strategy | 394 | |
11.6.2 | When the TMD is solved | 395 |
11.6.3 | Rounding test | 396 |
11.6.4 | Accurate second step | 400 |
11.6.5 | Error analysis and the accuracy/performance tradeoff | 401 |
11.7 | Computing Error Bounds | 402 |
11.7.1 | The point with efficient code | 402 |
11.7.2 | Example: a "double-double" polynomial evaluation | 403 |
12 | Solving the Table Maker's Dilemma | 405 |
12.1 | Introduction | 405 |
12.1.1 | The Table Maker's Dilemma | 406 |
12.1.2 | Brief history of the TMD | 410 |
12.1.3 | Organization of the chapter | 411 |
12.2 | Preliminary Remarks on the Table Maker's Dilemma | 412 |
12.2.1 | Statistical arguments: what can be expected in practice | 412 |
12.2.2 | In some domains, there is no need to find worst cases | 416 |
12.2.3 Deducing the worst cases from other functions or | ||
domains | 419 | |
12.3 | The Table Maker's Dilemma for Algebraic Functions | 420 |
12.3.1 | Algebraic and transcendental numbers and functions | 420 |
12.3.2 | The elementary case of quotients | 422 |
12.3.3 | Around Liouville's theorem | 424 |
12.3.4 Generating bad rounding cases for the square root | ||
using Hensel 2-adic lifting | 425 | |
12.4 | Solving the Table Maker's Dilemma for Arbitrary Functions | 429 |
12.4.1 Lindemann's theorem: application to some | ||
transcendental functions | 429 | |
12.4.2 | A theorem of Nesterenko and Waldschmidt | 430 |
12.4.3 | A first method: tabulated differences | 432 |
12.4.4 From the TMD to the distance between a grid and a | ||
segment | 434 | |
12.4.5 | Linear approximation: Lefevre's algorithm | 436 |
12.4.6 | The SLZ algorithm | 443 |
12.4.7 | Periodic functions on large arguments | 448 |
12.5 | Some Results | 449 |
12.5.1 Worst cases for the exponential, logarithmic, | ||
trigonometric, and hyperbolic functions | 449 | |
12.5.2 | A special case: integer powers | 458 |
12.6 | Current Limits and Perspectives | 458 |
V Extensions | 461 | |
13 | Formalisms for Certifying Floating-Point Algorithms | 463 |
13.1 | Formalizing Floating-Point Arithmetic | 463 |
13.1.1 | Defining floating-point numbers | 464 |
13.1.2 | Simplifying the definition | 466 |
13.1.3 | Defining rounding operators | 467 |
13.1.4 | Extending the set of numbers | 470 |
13.2 | Formalisms for Certifying Algorithms by Hand | 471 |
13.2.1 | Hardware units | 471 |
13.2.2 | Low-level algorithms | 472 |
13.2.3 | Advanced algorithms | 473 |
13.3 | Automating Proofs | 474 |
13.3.1 | Computing on bounds | 475 |
13.3.2 | Counting digits | 477 |
13.3.3 | Manipulating expressions | 479 |
13.3.4 | Handling the relative error | 483 |
13.4 | Using Gappa | 484 |
13.4.1 | Toy implementation of sine | 484 |
13.4.2 | Integer division on Itanium | 488 |
14 | Extending the Precision | 493 |
14.1 | Double-Words, Triple-Words | 494 |
14.1.1 | Double-word arithmetic | 495 |
14.1.2 | Static triple-word arithmetic | 498 |
14.1.3 | Quad-word arithmetic | 500 |
14.2 | Floating-Point Expansions | 503 |
14.3 | Floating-Point Numbers with Batched Additional Exponent | 509 |
14.4 | Large Precision Relying on Processor Integers | 510 |
14.4.1 Using arbitrary-precision integer arithmetic for | ||
arbitrary-precision floating-point arithmetic | 512 | |
14.4.2 A brief introduction to arbitrary-precision integer | ||
arithmetic | 513 | |
VI Perspectives and Appendix | 517 | |
15 | Conclusion and Perspectives | 519 |
16 | Appendix: Number Theory Tools for Floating-Point Arithmetic | 521 |
16.1 | Continued Fractions | 521 |
16.2 | The LLL Algorithm | 524 |
Bibliography | 529 | |
Index | 567 | |
2Mul, 135,318 2MultFMA, 152 2Sum, 129, 318
accumulator, 314
accurate step, 396
ACL2,471
addition, 246
of binary floating-point in hardware, 288
of binary floating-point in software, 329
of integers, 274
of integers in decimal, 275
of signed zeros, 247
subnormal handling, 251, 294
additive range reduction, 378
Al-Khwarizmi, 167
algebraic function, 421,424
algebraic number, 420
a (smallest normal number), 17,153
alternate exception-handling attributes, 93,95
argument reduction, 378
arithmetic formats, 80
ARRE (average relative representation
error), 30
attributes, 93
Babai's algorithm, 528
Babylonians, 4
backward error, 186
bad cases for the TMD, 409
base, 13
basic formats, 56, 80
BCD (binary coded decimal), 83
Benford's law, 29
bias, 57, 82, 84,85
biased exponent, 58-60, 84-86, 245
big-endian, 61
binade, 415
binary 128, 82
binaryl6, 82
binary32,16, 82
binary64, 82
binding, 109
bipartite table method, 287
block floating-point, 313
Booth recoding, 281
breakpoint, 21, 406
Briggs, 375
Burger and Dybvig conversion algo- rithm, 44
C programming language, 209
C++ programming language, 220
C99, 210
cancellation, 124,193
canonical encoding, 83, 85
carry-ripple adder, 274
carry-select adder, 279
carry-skip adder, 276, 277
catastrophic cancellation, 124, 378
CENA, 182
Chebyshev
polynomials, 389
theorem, 391
dinger conversion algorithms, 46
close path, 249
closest vector problem, 525
Cody, 29
Cody and Waite reduction algorithm, 379
cohort, 14, 83, 97, 240
combination field, 83, 84
comparisons, 65, 99
comparison predicates, 65, 99
CompensatedDotProduct algorithm, 202
compensated algorithms, 182