I have often wondered why modern computer instruction set architectures do not have more efficient parallel synchronization mechanisms. Current mainstream microprocessor designs currently support two types of parallelism.
- Very fine grain
- Hardware based implicit instruction level
- Implemented via advanced pipeline register renaming.
- Synchronization delays on the order of a single cycle.
- Very course grain
- Software based explicit thread synchronization primitives
- Implemented via atomic memory instructions.
- Synchronization delays on the order of tens of thousands of cycles or more.
With CPU clock frequencies beginning to plateau it may be time to revisit architectural synchronization models as a method to continue improving overall program performance. If we have any bright PHD candidates reading this fishing for a dissertation topic, please consider this.
Parallel Architecture Models
At the process level we have the architectural notion of an interrupt. But at the thread level this does not exist. We have to rely on threads spinning in a loop reading and writing a shared memory location together with memory synchronization barriers and no architectural specification about how long this can take. This is ridiculous. We can't have efficient parallel programming if the programming model has no mechanism to facilitate it. We need some data queue or message passing mechanism or interrupt that operates at the instruction architecture level if we are to enable efficient parallel programming.
Explicit Instruction Level Parallelism
I would like to see an efficient software visible instruction level synchronization mechanism. For example, something like a 'Queue Register'. Some existing IO registers track read and write state. I'm thinking some general purpose registers could similarly be architected for managing data flow synchronization at the register data level. Such registers could essentially stall the execution pipeline on reads until a write to that register has occurred. So the register effectively acts as a 'data queue' at the instruction execution level. This would enable software control of fine grain parallelism, opening up potentially more real parallelism than relying on hardware to extract parallelism from an inherently sequential programming model.
Since all compute state needs to be visible in order to stop, save, and later restart a process, status bits will also need to track the read/write data state of each queue register. CPU pipelines could be redesigned to key off of these explicit reg data states, instead of implicit internal hardware states. Just like current hardware threads swap in whichever thread has data ready, these new threads could work the same way. The primary difference being the data ready state is now software architecturally visible.
Further note that these hardware queue registers are effectively thread state ready registers, analogous to ready state flags in operating system thread schedulers. Since these ready flags are intended for micro data level parallelism, they should be closely aligned to the real register thread state supported by the hardware, as opposed to some arbitrary virtual state that relies on time slicing and swapping threads in and out of hardware. While time slicing is theoretically possible it would blow up performance by 10000 times, entirely defeating the advantage of micro level parallelism.
So there is a different mind set when programming this level of parallelism. This type of parallelism should have some awareness of the number of hardware threads efficiently supported by hardware, as opposed to some very course grain parallelism that has little concern about real hardware thread counts. The implication is that this level of coding is more appropriate for hand coded assembly or for compilers.