- High performance computing, e.g., image processing, climate modeling, physics simulation, gaming, facial recognition
- Consumer electronics
- Mobile devices
Technology improvements in combination with power and clock frequency constraints have propelled greater parallelism in order to continue to accelerate growth in computer performance. Continued performance growth intensifies stress on memory and DRAM technology improvements are not projected to scale fast enough to meet future demands due to manufacturing and data movement constraints. Additionally, projections state that chip pins that directly correlate available bandwidth to off-chip memory increase by 10% every year, whereas processing capacity doubles every 18 months.
Previous prefetchers have complications with correlation due to the fact they are generally oblivious to which computation unit created each memory access pattern, and thus do not correlate.
Researchers at Berkeley Lab have developed a purely hardware last-level collective prefetcher (LLCP) to address the constraints of DRAM performance and power for bulk-synchronous data-parallel applications that are key drivers for multi-core, e.g., image processing, climate modeling, physics simulation, gaming, facial recognition, and many others.
The Berkeley Lab LLCP exploits the highly correlated prefetch patterns of data-parallel algorithms not recognized by a prefetcher oblivious to data parallelism. LLCP generates prefetches on behalf of multiple cores in memory address order to maximize DRAM efficiency and bandwidth. The technology can prefetch from multiple memory pages without expensive translations.
This technology employs one computation unit’s access patterns to detect if the type of application currently running can detect the access patterns of other computation units. When this is the case, it correlates access patterns together and is able to prefetch from larger memory regions than previous prefetchers.
This invention has demonstrated better application performance, lower memory access energy, and faster memory access time. Compared to other prefetchers, LLCP improves execution time by 5.5% on average (10% maximum), increases DRAM bandwidth by 9% to 18%, decreases DRAM rank energy by 6%, produces 27% more timely prefetches, and increases coverage by a minimum of 25%.
- Maximizes DRAM efficiency and bandwidth
- Reduces power consumption for computing and mobile devices
LBL PRINCIPAL INVESTIGATOR: John Shalf
DEVELOPMENT STAGE: Proven principle. See researcher test results in the IEEE Computer Society publication linked below.
FOR MORE INFORMATION:
Michelogiannakis, G., Shalf., J. “Last level collective hardware prefetching for data-parallel applications,” IEEE Computer Society, 2017. DOI 10.1109/HiPC.2017.00018
STATUS: Patent pending. Available for licensing or collaborative research.
RELATED INVENTIONS: Collective Memory Transfers for Multi-Core Processors 2013-086