Small amounts of unusually fast memory (Data D$, Instruction I$, Translation Lookaside Buffer TLB), cache misses, speculatively prefetch, does it fit in cache (small is fast), no time/space tradeoff at hardware level, locality counts (stay in the cache), predictability helps.
A modern multi-core machine will have a multi-level cache hierarchy, where the faster and smaller caches belong to individual processors. When one processor modifies a value in its cache, other processors cannot use the old value any more. That memory location will be invalidated in all of the caches. Furthermore, since caches operate on the granularity of cache lines and not individual bytes, the entire cache line will be invalidated in all caches.
Another definition is: “a multiprocessor is cache consistent if all writes to the same memory location are performed in some sequential order”.
- Content-addressable memory
- MSI: modified shared invalid
- Cache interventions
- Exclusive state: issue read, if nobody else has it get write
- Owner state (AMD)
- Atomic transaction bus
- Split transaction bus: everybody observes all the requests at the same time, responses come back some time later
- Non atomicity > transient states
- Scaling cache coherence > directory based coherence
- Premature optimisation
- Assume any pointer indirection is a cache miss
- Smaller working data set is better
- Use as much of a cache entry as you can: don’t read a whole cache line to check one bit
- Doing more work can be faster than doing less
- Sequential memory can be very fast due to prefetching
- Fewer evicted cache lines means more data in hot cache for the rest of the program
- Mispredicted branches can evict cache entries (see speculative execution)
- Valgrind is in essence a virtual machine using just-in-time (JIT) compilation techniques, including dynamic recompilation. Nothing from the original program ever gets run directly on the host processor.
- Tracy profiler
In computer architecture, cache coherence is the uniformity of shared resource data that ends up stored in multiple local caches. When clients in a system maintain caches of a common memory resource, problems may arise with incoherent data, which is particularly the case with CPUs in a multiprocessing system.
MESI cache coherence protocol
Each cache line has state, tag, data.
- Modified - write back the data before anybody else is allowed to read main memory
- Exclusive - mark if nobody is sharing, can change with impunity
- Invalid - unused
- Write-invalidate (common)
- Write-update (causes greater bus activity)
Directory based coherence
Used for large CPUs.
write-back vs. write-through
All Intel-compatible CPUs during the last one/two decades used a write-back strategy for caches (which presumes fetching a cache line first to allow partial writes). Of course that’s the theory, reality is slightly more complex than that.
There are three kinds of cache misses: 1. Instruction read miss 1. Data read miss 1. Data write miss
Cache read misses from an instruction cache generally cause the largest delay, because the processor, or at least the thread of execution, has to wait (stall) until the instruction is fetched from main memory. Cache read misses from a data cache usually cause a smaller delay, because instructions not dependent on the cache read can be issued and continue execution until the data is returned from main memory, and the dependent instructions can resume execution. Cache write misses to a data cache generally cause the shortest delay, because the write can be queued and there are few limitations on the execution of subsequent instructions: the processor can continue until the queue is full. There is a more detailed introduction to the types of misses here.
- Straight line code (no branches)
- Linear array traversal
- Avoid iteration over heterogeneous sequences with virtual calls (sort first)
- Make “fast paths” - branch-free sequences
- Inline cautiously
- Take advantage of PGO and WPO
Cache related issues
- Memory banks
- Inclusive versus exclusive content
- Dirty cache not a problem for read-only
- False cache line sharing - cache ping pong
Cache performance evaluate
- Why it’s critical
- Why it’s hard
- Direct mapped cache
- N-way set associative cache
- Fully associative cache
Cache oblivious algorithm design
Typically a cache-oblivious algorithm works by a recursive divide and conquer algorithm, where the problem is divided into smaller and smaller sub-problems. Eventually, one reaches a sub-problem size that fits into cache, regardless of the cache size.
- PGO: profile guided optimisation
- WPO: whole program optimisation
Design to get the most value out of every single cacheline that is read. Split apart the functions and data.
Performance of a program is dependent on
time/program = instructions/program _ cycles/instruction _ time/cycle
- Compiler, optimiser
- Cycles per instruction: pipelining
- Time per clock cycle
CPI = CPI_ideal + CPI_stall
- Data hazards
- Control hazards: branches, exceptions
- Memory latency: cache misses
- Speculative execution
- MIT pipelines
- Pipeline stall
- Vector processor
- What every programmer should know about memory
- Be nice to your cache
- False sharing is not fun – Herb Setter
- Gallery of processor cache effects – Igor Ostrovsky
- Writing faster managed code Jan Gray (section: Of Cache misses, page faults and computer architecture)
- Optimising for instruction caches – Amir Kleen
- Memory is not free (more on Vista performance) – Sergey Solynik