![]() However, existing timing speculation caches do not track the variations of different PVT conditions to adjust the access timing to the optimal TBL point, on which the system possesses the lowest average memory access time. ![]() Meanwhile, for a given process, voltage, and temperature (PVT) condition, CS-SRAM has an optimal bitline discharging time (TBL) to achieve the lowest average access latency. To improve the performance of SRAM in caches under near-threshold voltages, several timing speculation techniques, such as the cross-sensing SRAM (CS-SRAM), are proposed. In a low power 28nm FDSOI process a peak efficiency of 193MOPS/mW(40MHz, 1mW) can be achieved. With four NT-optimized cores, the cluster is operational from 0.6V to 1.2V achieving a peak efficiency of 67MOPS/mW in a low-cost 65nm bulk CMOS technology. ![]() SIMD extensions, such as dot-products, and a built-in L0 storage further reduce the shared memory accesses by 8x reducing contentions by 3.2x. For typical data-intensive sensor processing workloads the proposed core is on average 3.5x faster and 3.2x more energy-efficient, thanks to a smart L0 buffer to reduce cache access contentions and support for compressed instructions. ![]() We introduce instruction-extensions and microarchitectural optimizations to increase the computational density and to minimize the pressure towards the shared memory hierarchy. In this paper we describe the design of an open-source RISC-V processor core specifically designed for NT operation in tightly coupled multi-core clusters. Near-threshold(NT) operation can achieve higher energy efficiency, and the performance scalability can be gained through parallelism. Endpoint devices for Internet-of-Things not only need to work under extremely tight power envelope of a few milliwatts, but also need to be flexible in their computing capabilities, from a few kOPS to GOPS.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |