Watch the whole lecture (recommended).

Integrals

Prefer 32-bit ints to other sizes

  • 32-bits is the sweet spot as 64-bit ALU can handle two calculations
  • 8, 16-bit computations convert to 32-bits so don’t use smaller
  • Use small ints in arrays

Prefer unsigned to signed

  • Unsigned is faster
  • Except when converting to floating point

Most numbers are small

  • If you find optimisations that work with small numbers, use them

Floating point

  • Double and single precision equivalent speed
  • 80-bit extended only slightly slower
  • But don’t mix them (because conversions)
  • Ints to float point cheap
  • Floating point to any integral is expensive

Strength reduction

Use minimum strength operations when optimising as the stronger ones are more costly.

Speed hierarchy

operation cycles
Comparisons <1
(u)int add, subtract, bitops, shift (2 x 32-bits) 1
Floating point add, subtract TBC
(u)int32 multiply, floating point multiply TBC
Floating point division, remainder TBC
(u)int division, remainder TBC

Why is it cheaper to do floating point division than integer?

Format of floating point: mantissa and exponent, already friendly to division.

Division in integers is unnatural. Division is a search!

Minimise indirect writes

  • Writes through a pointer: indirection
  • Hard to speculate, hard to enregister (put in a register)
  • Whenever you write it’s really a read and a write
  • Because you transfer a 64-byte cache line: CPU has to read the cache line, update the word and write it back; if you can keep all values for a calculation in one CL all the better (even inject fake writes!)
  • The more dirty cache you have, the more writes you need to do

Improvements

  • Fewer indirect writes
  • Regular access patterns

Checkpoint

  • You can’t improve what you can’t measure (you can’t measure what you don’t measure :))
  • Always choose good baselines
  • Optimise code judiciously for today’s dynamics

The forgotten parallelism

  • Reservation station
  • Instruction-level parallelism on a single core

How do us ALUs? – ILP

  • Pipelining
  • Superscaler execution
  • Out-of-order execution
  • Register renaming
  • Speculative execution
  • … and many more

better instruction level parallelism = fewer data dependencies


Part 2

https://www.youtube.com/watch?v=9tvbz8CSI8M

  • Measure
  • Use good baselines
  • Reduce strength
  • Reduce dependencies
  • Knowing math is knowing optimisation

Unsigned integer overflow has well-defined behaviour; signed integer overflow is undefined.

Part 3

https://www.youtube.com/watch?v=FJJTYQYB1JQ

  • Incorporate conditionals into expressions rather than using ifs: + (size & 1)
  • Always use infinite loops (except most times)