Notes from Andrei at code::dive 2015

Watch the whole lecture (recommended). Integrals Prefer 32-bit ints to other sizes 32-bits is the sweet spot as 64-bit ALU can handle two calculations 8, 16-bit computations convert to 32-bits so don’t use smaller Use small ints in arrays Prefer unsigned to signed Unsigned is faster Except when converting to floating point Most numbers are small If you find optimisations that work with small numbers, use them Floating point Double and single precision equivalent speed 80-bit extended only slightly slower But don’t mix them (because conversions) Ints to float point cheap Floating point to any integral is expensive Strength reduction Use minimum strength operations when optimising as the stronger ones are more costly. [Read More]


Small amounts of unusually fast memory (Data D$, Instruction I$, Translation Lookaside Buffer TLB), cache misses, speculatively prefetch, does it fit in cache (small is fast), no time/space tradeoff at hardware level, locality counts (stay in the cache), predictability helps. A modern multi-core machine will have a multi-level cache hierarchy, where the faster and smaller caches belong to individual processors. When one processor modifies a value in its cache, other processors cannot use the old value any more. [Read More]


Multi-threaded concepts are important: e.g., atomics, locks, issues with different designs, how to make things thread safe. Cache locality is another huge thing these days. Asynchronous architectures and callbacks are what you will be dealing with every day. What is cache locality? How do multicore systems ensure their caches are in sync? How do you get around this problem? Why are signals slow and why is context switching bad? What exactly happens during a context switch? [Read More]


Kinds of parallelism bit level instruction level (ILP) data (DLP/SIMD) task parallelism (TLP/MIMD) See YouTube/MIT - parallel processing. Examples Distributed processing over networks Multiple CPUs Multiple cores Pipelines (deeper and wider pipeline = more control hazards) ILP - instruction level parallelism (at best x2 speed up) MLP - Memory-level parallelism is a term in computer architecture referring to the ability to have pending multiple memory operations, in particular cache misses or translation lookaside buffer (TLB) misses, at the same time Loop unrolling Out-of-order execution - OoO of multiple instructions simultaneously Single Operation-Multiple-Data (SIMD) operations in vector registers Multiple CPU cores on the same chip Speculative execution Branch prediction versus branch target prediction SSE and AVX Moore’s law hits the roof OpenMP C++ AMP - Accelerated Massive Parallelism Pluralsight - High-performance Computing in C++ SMOP - small matter of programming: multiple cores are the way we’re heading, working out how to use them is the difficult part Vector processing - think about it like explicitly managing giant cache lines GPGPU Advance Vector Extensions AVX - xmm ymm zmm Amdahl’s law Amdahl’s law shows the maximum speed up that can be achieved by parallelising a pipeline is related to the proportion that can be done in parallel. [Read More]