I have often wondered why modern computer instruction set architectures do not have more efficient parallel synchronization mechanisms. Current mainstream microprocessor designs currently support two types of parallelism.
- Very fine grain
- Hardware based implicit instruction level
- Implemented via advanced pipeline register renaming.
- Synchronization delays on the order of a single cycle.
- Very course grain
- Software based explicit thread synchronization primitives
- Implemented via atomic memory instructions.
- Synchronization delays on the order of thousands of cycles or more.
With CPU clock frequencies beginning to plateau it may be time to revisit architectural synchronization models as a method to continue improving overall program performance. If we have any bright PHD candidates reading this fishing for a dissertation topic, please consider this.
I would like to see an efficient software visible instruction level synchronization mechanism. For example, something like a 'Queue Register'. Some existing IO registers track read and write state. I'm thinking some general purpose registers could similarly be architected for managing data flow synchronization at the register data level. Such registers could essentially stall the execution pipeline on reads until a write to that register has occurred. So the register effectively acts as a 'data queue'. This would enable software control of fine grain parallelism, opening up potentially more real parallelism than relying on hardware to extract parallelism from an inherently sequential programming model.
Since all compute state needs to be visible in order to stop, save, and later restart a process, status bits will also need to track the read/write data state of each queue register. CPU pipelines could be redesigned to key off of these explicit reg data states, instead of implicit internal hardware states. Just like current hardware threads swap in whichever thread has data ready, these new threads could work the same way. The primary difference being the data ready state is now software architecturally visible.
Further note that these hardware queue registers are effectively thread state ready registers, analogous to ready state flags in operating system thread schedulers. Since these ready flags are intended for micro data level parallelism, they should be closely aligned to the real register thread state supported by the hardware, as opposed to some arbitrary virtual state that relies on time slicing and swapping threads in and out of hardware. While time slicing is theoretically possible it would blow up performance by 10000 times, entirely defeating the advantage of micro level parallelism.
So there is a different mind set when programming this level of parallelism. This type of parallelism should have some awareness of the number of hardware threads efficiently supported by hardware, as opposed to some very course grain parallelism that has little concern about real hardware thread counts. The implication is that this level of coding is more appropriate for hand coded assembly or for compilers.
Food for thought.