# HuNT: Exploiting Heterogeneous PIM Devices to Design a 3-D Manycore Architecture for DNN Training

Chukwufumnanya Ogbogu<sup>®</sup>, Graduate Student Member, IEEE,

Gaurav Narang<sup>D</sup>, Graduate Student Member, IEEE, Biresh Kumar Joardar<sup>D</sup>, Member, IEEE, Janardhan Rao Doppa<sup>10</sup>, Senior Member, IEEE, Krishnendu Chakrabarty<sup>10</sup>, Fellow, IEEE, and Partha Pratim Pande<sup>(D)</sup>, *Fellow*, *IEEE* 

Abstract-Processing-in-memory (PIM) architectures have 2 emerged as an attractive computing paradigm for acceler-3 ating deep neural network (DNN) training and inferencing. 4 However, a plethora of PIM devices, e.g., resistive random-5 access memory, ferroelectric field-effect transistor, phase change 6 memory, MRAM, static random-access memory, exists and each 7 of these devices offers advantages and drawbacks in terms power, latency, area, and nonidealities. A heterogeneous 8 of 9 architecture that combines the benefits of multiple devices in a 10 single platform can enable energy-efficient and high-performance 11 DNN training and inference. 3-D integration enables the design 12 of such a heterogeneous architecture where multiple planar tiers 13 consisting of different PIM devices can be integrated into a single 14 platform. In this work, we propose the HuNT framework, which 15 hunts for (finds) an optimal DNN neural layer mapping, and 16 planar tier configurations for a 3-D heterogeneous architecture. 17 Overall, our experimental results demonstrate that the HuNT-18 enabled 3-D heterogeneous architecture achieves up to 10x and 19  $3.5 \times$  improvement with respect to the homogeneous and existing 20 heterogeneous PIM-based architectures, respectively, in terms 21 of energy-efficiency (TOPS/W). Similarly, the proposed HuNT-22 enabled architecture outperforms existing homogeneous and <sup>23</sup> heterogeneous architectures by up to 8× and 2.4×, respectively, in 24 terms of compute-efficiency (TOPS/mm<sup>2</sup>) without compromising 25 the final DNN accuracy.

Index Terms-DNN, FeFET, PIM, ReRAM, SRAM. 26

27

## I. INTRODUCTION

EEP neural networks (DNNs) are widely employed to 28 solve complex problems in a variety of application 30 domains, including computer vision, natural language process-<sup>31</sup> ing (NLP), and time-series sensor data analytics [1]. However,

Manuscript received 8 August 2024; accepted 9 August 2024. This work was supported in part by the U.S. National Science Foundation (NSF) under Grant CNS-1955353 and Grant CSR-2308530. This article was presented at the International Conference on Hardware/Software Codesign and System Synthesis (CODES + ISSS) 2024 and appeared as part of the ESWEEK-TCAD Special Issue. This article was recommended by Associate Editor S. Dailey. (Chukwufumnanya Ogbogu and Gaurav Narang contributed equally to this work.) (Corresponding author: Partha Pratim Pande.)

Chukwufumnanya Ogbogu, Gaurav Narang, Janardhan Rao Doppa, and Partha Pratim Pande are with Department of Electrical Engineering and Computer Science, Washington State University, Pullman, WA 99164 USA (e-mail: c.ogbogu@wsu.edu; gaurav.narang@wsu.edu; jana.doppa@wsu.edu; pande@wsu.edu).

Biresh Kumar Joardar is with the Department of Electrical and Computer Engineering, University of Houston, Houston, TX 77204 USA.

Krishnendu Chakrabarty is with the Department of Electrical and Computer Engineering, Arizona State University, Tempe, AZ 85287 USA.

Digital Object Identifier 10.1109/TCAD.2024.3444708

DNNs have hundreds of millions of trainable parameters, 32 which need to be tuned using large and complex datasets. The 33 high latency and energy cost of data movement between the processing cores and memory units in traditional computing 35 platforms based on the von-Neuman architecture (e.g., CPUs 36 and GPUs) impose significant performance bottlenecks while 37 executing DNN workloads, which is referred to as the memory 38 *wall* challenge [2]. Consequently, there has been a growing 39 demand for domain-specific computing platforms that seam-40 lessly integrate both storage and computing, thereby enabling 41 high-performance and energy-efficient acceleration of DNN 42 workloads [3]. 43

Processing-in-memory (PIM)-based computing platforms 44 have emerged as a promising alternative for executing 45 DNN workloads. This is due to their ability to perform 46 energy-efficient computation within the memory to eliminate 47 unnecessary data movement, thus addressing the memory-48 wall challenge. Specifically, the use of CMOS-based memory 49 devices, such as static random-access memory (SRAM), and 50 nonvolatile memory (NVM) devices, such as resistive random-51 access memory (ReRAM), phase change memory (PCM), 52 ferroelectric field-effect transistors (FeFETs), and spintronic memory (MRAM), have been widely studied as suitable can-54 didates for accelerating DNN training and inferencing [2], [3], 55 [4], [5], [6]. However, each of these PIM devices offers specific 56 advantages and drawbacks in terms of dynamic and leakage 57 power, area, latency, retention, endurance, and nonidealities, 58 when used as the PIM device in DNN accelerators [3]. For 59 example, ReRAM devices have almost  $\sim 30 \times$  higher write 60 latency compared to FeFET devices. However, ReRAMs can 61 have a write endurance of as high as  $\sim 10^{12}$  programming 62 cycles whereas FeFETs have an endurance of  $\sim 10^5$  cycles [7]. 63 An ideal memory device suitable for energy-efficient and 64 high-performance PIM-based DNN accelerators should have 65 low read/write latency (< 1 ns), low dynamic and leakage 66 energy (< 3 pJ), high write endurance (>  $10^{17}$  cycles), small 67 memory cell footprint ( $< 4F^2$ ), and excellent scalability to lower technology nodes (< 10 nm) [8]. However, so far, 69 no particular PIM device has all the ideal characteristics. 70 At the same time, DNN workloads are composed of neural 71 layers, which can differ significantly in terms of the number of layers, weight parameters, kernel size, input and output 73 information across layers in the forward-propagation, and 74

1937-4151 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

<sup>75</sup> frequency of weight updates during the back-propagation step.
<sup>76</sup> These characteristics determine the suitability of each neural
<sup>77</sup> layer in the forward- and back-propagation phase of the DNN
<sup>78</sup> workload to be executed on a specific PIM-enabled processing
<sup>79</sup> element (PE) in terms of area, latency, power, and endurance.
<sup>80</sup> Hence, this PE-level heterogeneity in PIM-based architectures
<sup>81</sup> needs to be exploited to achieve the best tradeoff in terms of
<sup>82</sup> power, area, performance, and DNN accuracy while designing
<sup>83</sup> a suitable accelerator platform.

Integrating different memory devices in a single platform 84 85 presents unique challenges. Specifically, manufacturing tech-86 nologies of NVM devices vary and they are not always <sup>87</sup> CMOS-compatible [9]. Hence, this hinders the feasibility 88 of integrating such heterogeneous PEs into a single planar <sup>89</sup> architecture. 3-D integration enables the mapping of disparate 90 technologies to different planar tiers [9], [10]. However, 91 existing implementations of 3-D heterogeneous architectures <sup>92</sup> are not well optimized for PIM devices, as they do not consider <sup>93</sup> the device-level characteristics in their design optimization 94 flow. For example, 3-D architectures are known to give rise to 95 thermal hotspots. PIM devices, such as ReRAM and FeFET, <sup>96</sup> are susceptible to nonidealities due to thermal noise, which <sup>97</sup> potentially degrades the accuracy of trained DNNs [11], [12]. 98 As a result, critical DNN model layers mapped to PEs placed <sup>99</sup> in planar tiers that are away from heat sinks can potentially <sup>100</sup> degrade the test accuracy of the DNN due to thermal hotspots. 101 Hence, in order to meet the high accuracy demand of DNN 102 applications, suitable placement of the PEs on planar tiers in 103 a 3-D system is important.

Furthermore, existing heterogeneous DNN accelerators do 104 105 not consider the characteristics of DNN workloads and the 106 properties of different PIM devices while mapping DNN 107 neural layers to PEs in the overall architecture [2], [4]. For 108 example, neural layers with large number of weights and 109 activations mapped to a PE with high read/write energy 110 would consume more power compared to a PE with less 111 read/write energy. In addition, different neural layers have <sup>112</sup> varying impact on DNN accuracy [13]. Hence, they need to 113 be suitably mapped to appropriate PEs on a planar tier in 114 the 3-D architecture without degrading the final predictive 115 accuracy. Hence, the layer-to-PE and PE-to-tier mapping in a 116 3-D heterogeneous system impact the overall performance in 117 terms of latency, area, power, and accuracy while executing 118 DNN workloads.

In this article, we propose a design space exploration 119 120 methodology called HuNT that undertakes neural layer-to-121 PE and PE-to-planar tier mapping to design an optimized 122 3-D heterogeneous manycore architecture for training DNN 123 workloads. We consider SRAM, ReRAM, and FeFET PIM-124 enabled PEs for studying the efficacy of the HuNT framework. 125 These heterogeneous PIM devices largely vary in terms of 126 area, power, latency, and endurance. This variation pro-127 vides HuNT with the scope of optimizing across multiple 128 conflicting, yet crucial objectives, namely: latency, accu-129 racy, area, and power. We capture these objectives using 130 three performance evaluation metrics: 1) energy-efficiency 131 (TOPS/W); 2) compute-efficiency (TOPS/mm<sup>2</sup>); and 3) DNN 132 predictive accuracy. Recent work has proposed optimization 133 methodologies aimed at exploring device-level heterogeneity

in PIM accelerators [14], [15], [16]. However, these techniques 134 are focused on DNN inference scenarios, and cannot 135 handle the more challenging scenario of DNN training. 136 Specifically, the computation of the weight- and activation- 137 gradients in the back-propagation phase requires multiple write 138 operations and high-precision computation. However, NVM 139 devices have limited write endurance, and store weights and 140 activations in fixed-point representation [3], [17]. These critical drawbacks limit the applicability of existing NVM-based 142 PIM accelerators to DNN training. In this work, in addition 143 to the energy-efficient NVM devices (ReRAM and FeFET), 144 we have also incorporated a CMOS-based memory device 145 (SRAM) which can perform high-precision computation in 146 the back-propagation phase, and has a high write endurance 147 into the HuNT framework. This heterogeneity in PIM devices 148 enables reliable, energy-efficient, and high-performance DNN 149 training on 3-D heterogeneous PIM architectures. The key 150 contributions of this work are as follows. 151

- We propose the HuNT framework that determines the 152 mappings of DNN layer to heterogeneous PEs and the 153 corresponding PE to planar tier mapping to design a 154 3-D heterogeneous manycore architecture tailor-made 155 for DNN training. The heterogeneity enables significant 156 improvement in energy-efficiency, area-efficiency, and 157 endurance compared to its homogeneous counterparts. 158
- We demonstrate the transferability of the HuNT-enabled 159
   3-D heterogeneous manycore architecture for diverse 160
   datasets. The hardware architecture optimized with 161
   CIFAR-10 dataset is equally effective for larger datasets, 162
   such as CIFAR-100 and TinyImageNet. Hence, this 163
   reduces the cost of repeated optimization as no extra 164
   training is required for complex datasets. 165
- 3) Our experimental results show that the HuNT-enabled <sup>166</sup> 3-D heterogeneous PIM architecture outperforms stateof-the-art heterogeneous PIM architectures, namely, <sup>168</sup> AccuReD and HyperX by up to  $3.5 \times$  and  $4.5 \times$ , respectively, in terms of energy-efficiency (TOPS/W), and <sup>170</sup>  $2.2 \times$  and  $3.2 \times$  in terms of area-efficiency (TOPS/mm<sup>2</sup>), <sup>171</sup> respectively. <sup>172</sup>

To the best of our knowledge, HuNT is the first-of-its 173 kind framework that jointly incorporates DNN layer-to-PE and 174 PE-to-planar tier mapping in a 3-D architecture to achieve 175 high-performance, energy-efficient, and reliable DNN training. 176

## II. BACKGROUND AND RELATED PRIOR WORK

In this section, we discuss relevant prior work on PIM-based 178 architectures for accelerating DNN workloads. Specifically, 179 we focus on homogeneous PIM architectures solely based on 180 either SRAM, ReRAM, or FeFET devices, as well as their 181 advantages and limitations. Table I compares the characteristics of SRAM, ReRAM, and FeFET PIM devices. Next, we 183 discuss heterogeneous architectures that combine two or more 184 of these devices, and finally shed more light on 2.5-D and 185 3-D-based PIM accelerators for DNNs. 186

# A. Homogeneous PIM Architectures

*SRAM* cells have been used as a crossbar-based PIM device 188 for high accuracy DNN training and inference [18], [19]. This 189

187

255

| Property                                 | <b>SRAM</b> [7]          | <b>ReRAM</b> [33]      | FeFET [7]              |
|------------------------------------------|--------------------------|------------------------|------------------------|
| Multi-bit Cell                           | No                       | Yes                    | Yes                    |
| <sup>†</sup> Cell Area (F <sup>2</sup> ) | 150F <sup>2</sup>        | $4F^2$                 | 35F <sup>2</sup>       |
| Write Energy                             | 3pJ                      | 2nJ                    | 5pJ                    |
| Write Latency                            | ~1ns                     | ~100ns                 | ~3ns                   |
| Write Endurance                          | >10 <sup>17</sup> cycles | 10 <sup>8</sup> cycles | 10 <sup>5</sup> cycles |
| Leakage Energy                           | High                     | Low                    | Low                    |

TABLE I Comparison of Various PIM Devices

<sup>†</sup>F is the minimum feature size [14]

190 is due to their low device variability, high write endurance, <sup>191</sup> low susceptibility to noise, and low write latency as shown <sup>192</sup> in Table I [3]. However, the 6T-cell configuration of SRAMs <sup>193</sup> with a cell size of 150F<sup>2</sup> (as shown in Table I) leads to the <sup>194</sup> high area overhead of SRAM-based crossbar arrays [3], [14]. 195 Additionally, SRAMs suffer from high leakage energy and 196 have low density storage (i.e., can only store 1-bit per-cell) <sup>197</sup> thereby making them less energy- and area-efficient compared 198 to other PIM-devices. Hence, this makes SRAM-based PIM <sup>199</sup> platforms infeasible for large DNN models with large number 200 of weights and activations and many neural layers. Recent work has also leveraged DRAM technology for PIM-based 201 architectures due to its small cell area [20]. However, DRAM 202 203 suffers from high leakage power and refresh energy due 204 to its volatile nature. Moreover, the 1T1C structure of the 205 DRAM cell lacks in-situ compute capability, hence cannot <sup>206</sup> enable parallel energy-efficient matrix-vector multiply (MVM) 207 operations required for DNN training [20]. Consequently, this <sup>208</sup> has led researchers to explore NVM devices, such as FeFET 209 and ReRAMs.

*ReRAM*-based NVM device enables high-density storage 210 <sup>211</sup> due to its multibit cell storage capability [3], [4]. Additionally, 212 ReRAM devices have relatively small cell area and low-<sup>213</sup> leakage energy compared to SRAMs, as shown in Table I. 214 However, despite these advantages, ReRAM cells suffer from 215 low write endurance, high write energy, and latency com-216 pared to SRAMs. As a result, this limits the applicability of 217 ReRAM-based PIM architectures for DNN training scenarios, the back-propagation phase requires a significant number 218 as <sup>219</sup> of write operations [3]. Additionally, ReRAM cells become 220 less reliable as temperature increases over time, which can cause errors, thereby leading to a degradation in the DNN 221 222 predictive accuracy. Also, despite the small cell area of <sup>223</sup> ReRAMs ( $\sim 4F^2$ ), the high-resolution ADCs required by the 224 ReRAM crossbar array introduces significant area and energy 225 overhead [4]. Hence, this potentially limits the benefits of <sup>226</sup> using ReRAM-devices in PIM-based architectures.

*FeFET* devices have been explored as another possibility for PIM-based DNN accelerators. FeFET PIM devices are particularly attractive due to their relatively low cell area ( $\sim$ 35 compared to SRAMs, high read and write speeds, low write energy, and low-leakage energy. Moreover, they exhibit relatively better temperature stability compared to ReRAM [7]. However, as shown in Table I, a key drawback of FeFET PIM devices is their low write endurance compared to other memory technologies, such as SRAMs and ReRAMs. This to the collapse of the separation between the ON and OFF states of the FeFET device (also known as the <sup>237</sup> *memory window*) after repeated program/erase cycles [5]. <sup>238</sup> Consequently, this can cause read errors during DNN training <sup>239</sup> and inference. <sup>240</sup>

Overall, homogeneous architectures built solely using either <sup>241</sup> SRAM, ReRAM, or FeFET PIM devices have their unique <sup>242</sup> advantages, as well as drawbacks that limit their applicability <sup>243</sup> for DNN training and inference workloads. Therefore, explor-<sup>244</sup> ing heterogeneous PIM architectures that combine one or <sup>245</sup> more PIM devices is necessary to achieve better performance, <sup>246</sup> power, area, and DNN predictive accuracy tradeoffs compared <sup>247</sup> to the homogeneous ones. For the scope of this work, we <sup>248</sup> have considered SRAM CMOS-based devices, FeFET and <sup>249</sup> ReRAM NVM-based devices, as examples to demonstrate <sup>250</sup> the viability of our proposed framework to design optimized <sup>251</sup> heterogeneous PIM accelerators. Note however that other types <sup>252</sup> of PIM devices, such as PCMs and MRAMs, can also be <sup>253</sup> considered for heterogeneous systems. <sup>254</sup>

# B. Heterogeneous PIM Architectures

Prior work has proposed heterogeneous architectures that 256 combine two or more PIM devices for accelerating DNN work- 257 loads. Various hybrid ReRAM/SRAM-based PIM architectures 258 have been proposed to address the nonidealities in ReRAM 259 devices, and reduce the high area overhead of SRAM. Some of 260 these approaches involve encoding the MSBs using SRAMs, 261 and RRAMs for the LSBs of multibit weights, while main- 262 taining high energy-efficiency [21]. Other methods involve the 263 use of ReRAM and SRAM to perform the DNN forward- and 264 back-propagation operations, respectively, thereby mitigating 265 the limited endurance challenge of ReRAM. In fact, a recent 266 hybrid architecture incorporates SRAM macros to perform 267 output compensation of the nonideal output of ReRAM cross- 268 bars, thereby enabling robust DNN inference [16]. However, 269 these methods do not consider the layer-wise characteristics 270 of DNN workloads (e.g., number of neural layers, weights, 271 activations, size of kernels, etc.) while mapping neural layers 272 to the heterogeneous PIM-based architectures. As a result, this 273 can lead to suboptimal performance while executing DNN 274 training and inference tasks. 275

A recent work called HyDe has proposed a design space 276 exploration methodology for finding an optimal mapping of 277 DNN layers to either SRAM, FeFET, or PCM devices in 278 a hybrid platform [14]. This approach leverages the char- 279 acteristics of each DNN layer to find its affinity toward 280 a specific type of PIM device. However, this approach 281 is aimed only at inferencing, and considered a scalarized 282 single-objective optimization formulation. However, linear 283 scalarization is known to perform poorly due to its inability 284 to explore nonconvex regions of the Pareto front. Moreover, 285 HyDe follows a differentiable optimization approach, which 286 is not possible for all hardware design objectives and requires 287 training DNN weights by considering the device character- 288 istics. Hence, this is not practical for the DNN training 289 task. Other works, such as HyperX, have proposed a hybrid 290 SRAM/ReRAM architecture, where some DNN layer weights 291 remain static, and are mapped to ReRAMs, while other layers 292 are mapped to SRAMs for fine-tuning [22]. 293

Despite the advantages of heterogeneity, previous solutions do not consider the challenges of integrating different PIM devices into a single platform. Moreover, they are mostly targeted at DNN inferencing/fine-tuning applications and cannot be used for end-to-end training of large DNNs. Hence, suitable heterogeneous PIM architectures for DNN training scenarios need to be explored.

## 301 C. 2.5-D/3-D-Based PIM Architectures

To address the challenges associated with integrating 302 303 different PIM technologies in a single platform, various 304 heterogeneous integration methods have been proposed. Specifically, chiplet-based (2.5-D) integration techniques have 305 306 been proposed for DNN accelerators [14]. However, the 307 long-range on-chip communication in planar 2.5-D systems 308 presents a significant performance bottleneck in the execution <sup>309</sup> of DNN workloads [23]. Hence, 3-D heterogeneous integration 310 methods that stack planar tiers consisting of PEs connected to 311 each other using through-silicon-via (TSV)-based vertical links <sup>312</sup> have been proposed [23]. For example, a 3-D heterogeneous 313 architecture for accelerating DNN training known as AccuReD was recently proposed. AccuReD leverages ReRAM-based 314 315 PEs, and GPUs for accelerating all types of DNN layers to <sup>316</sup> enable high accuracy DNN training [23].

Despite offering the advantages of 3-D heterogeneous inte-317 318 gration, existing architectures do not consider the properties 319 of the neural layers while determining the mapping for DNN workloads. Moreover, 3-D architectures inherently suffer from 320 thermal issues, which have a varying impact on PEs with NVM 321 322 devices (FeFET and ReRAM). Prior work does not adequately 323 consider thermal issues while finding a suitable DNN layer-324 to-PE mapping in 3-D heterogeneous PIM architectures. As 325 a result, this potentially leads to degradation of predictive 326 accuracy, power, and latency when DNN workloads are exe-327 cuted. Hence, the properties of the DNN neural layers, PIM 328 device characteristics of the PEs, as well as the PE to 3-D 329 planar tier mapping should be jointly considered to enable 330 high performance, energy-efficient, and reliable DNN training 331 on heterogeneous PIM platforms.

#### III. HUNT FRAMEWORK

This section presents the problem formulation and optimization methodology of the HuNT framework to find the neural layer-to-PE and PE-to-tier mapping in a 3-D heterogeneous architecture for DNN training.

# 337 A. Problem Setup

332

We consider a manycore system with *C* PIM-based PEs distributed over *Z* planar tiers and stacked using TSV-based vertical links. We use a conventional mesh-based network on the chip (NoC) as the communication backbone [23]. Each planar etier consists of PEs of one particular type of PIM device, i.e., either SRAM (S), ReRAM (R), or FeFET (F). Fig. 1 illustrates manycore architecture. Given the characteristics of the DNN manycore architecture. Given the characteristics of the DNN eneural layers, and the physical properties of the PIM devices in the PEs, our goal is to find an optimized neural layer to PE mapping, and the corresponding PE to planar tier mapping



Fig. 1. Illustration of layer-to-PE and PE-to-tier mapping of DNN workload with *K*-layers on to a 3-D heterogeneous PIM-based architecture. Here, DNN layer  $L_1$  is mapped to ReRAM-based PEs and placed on Tier 1 as an example.

that achieves a suitable tradeoff between the training accuracy, <sup>349</sup> area, latency, and power. <sup>350</sup>

Without loss of generality, Fig. 1 shows a DNN workload <sup>351</sup> mapped on to a 3-D heterogeneous architecture. Here, each <sup>352</sup> neural layer ( $L_i$ ) of the DNN can be mapped onto either <sup>353</sup> SRAM-/FeFET-/ReRAM-based PEs, which can be located <sup>354</sup> either in tier-1, 2, or 3 as shown in Fig. 1. In addition, <sup>355</sup> each neural layer is characterized by its corresponding kernel <sup>356</sup> size, the number of input and output features, and the bit <sup>357</sup> precision of weights/activations, and can be mapped to one <sup>358</sup> or more PEs in a planar tier of the 3-D architecture. DNN <sup>359</sup> training requires the high-precision computation of weight- <sup>360</sup> and activation-gradients for each neural layer in the back- <sup>361</sup> propagation phase. This process requires a significant number <sup>362</sup> of write operations, which influences the choice of PIM device <sup>363</sup> for the computation of the back-propagation phase. <sup>364</sup>

Furthermore, the PIM devices in the PEs have their corresponding physical properties, such as write endurance limit, <sup>366</sup> area, energy, latency, and temperature-dependent nonideal <sup>367</sup> effects. Additionally, the distance of a planar tier from the <sup>368</sup> heat sink in the 3-D architecture determines the degree of <sup>369</sup> vulnerability of the PIM device to thermal noise, which can <sup>370</sup> potentially lead to significant loss in DNN accuracy [23]. <sup>371</sup> Consequently, this leads to a multiobjective optimization <sup>372</sup> layer to one of the *C* PIM-based PEs (i.e., either SRAM-<sup>374</sup>/ReRAM-/FeFET-based PE), as well as its appropriate location <sup>375</sup> in one of the *Z* planar tiers, that achieves the best latency, <sup>376</sup> area, power, and accuracy tradeoff. <sup>377</sup>

# B. HuNT MOO Formulation

Fig. 2 shows the overview of the proposed HuNT framework. The inputs to the framework are the number of planar tiers (Z), total number of PEs (C), PIM device choices, and DNN workload characteristics (e.g., number of neural layers, their weights, activations, etc.). We define the mapping vector  $\pi$  to characterize the mapping of K neural layers on to PEs in the 3-D architecture and the corresponding PE to planar tier mapping  $\alpha = [t_1, t_2, \dots, t_Z]$ , where  $t_i$  is the device type of a candidate design in the design space D which corresponds to a specific neural layer mapping ( $\alpha$ ). In each optimization pEs ( $\pi$ ) and PE-to-tier mapping ( $\alpha$ ). In each optimization iteration, one design d is evaluated using power, latency, area, and DNN accuracy estimation models. Our goal is to minimize the 1) loss in DNN accuracy (Err) due to various PIM device



Fig. 2. Overall workflow of the HuNT framework, showing input stage, optimization phase, and the final validation phase.

<sup>394</sup> nonidealities; 2) the area in terms of the number of PEs needed <sup>395</sup> to map all the DNN layers (Ar); 3) the latency (Lat); and <sup>396</sup> 4) power consumption (Pwr) while executing a given DNN <sup>397</sup> training on the 3-D heterogeneous architecture. We represent <sup>398</sup> the MOO formulation as

$$D^* = MOO(OBJ = Pwr(d), Ar(d), Lat(d), Err(d))$$
(1)

<sup>400</sup> where  $D^*$  is the set of Pareto-optimal designs. A design is <sup>401</sup> called *Pareto optimal* if it cannot be improved in any of the <sup>402</sup> design objectives without compromising some other objective. <sup>403</sup> The goal is to first find the Pareto-optimal set  $D^* \subseteq D$  using a <sup>404</sup> MOO solver. Next, we select feasible designs from the Pareto <sup>405</sup> set that meet the constraint (e.g., less than ~1% accuracy loss <sup>406</sup> compared to the ideal accuracy). Finally, we select the best <sup>407</sup> design  $d_{\text{best}}$  from the feasible designs that achieves the best <sup>408</sup> performance in-terms of either energy-efficiency (TOPS/W) or <sup>409</sup> compute-efficiency (TOPS/mm<sup>2</sup>).

Next, we discuss the key elements of our MOO formulation. *1) Inputs:* The inputs to the HuNT framework are the planar tiers (*Z*), total number of PEs (*C*), PIM dis device choices (ReRAM, SRAM, and FeFET), and DNN workload characteristics (e.g., weights, activations, etc.).

2) Design Variables: There are two types of design vari-416 ables for the optimization for a given DNN model. Each 417 candidate solution represents 1) a neural layer mapping to PEs 418 ( $\pi$ ) and 2) a PE-to-tier mapping ( $\alpha$ ), i.e., [ $t_1, t_2, \ldots, t_z$ ], where 419 the planar tiers  $t_1$  and  $t_z$  are closest and farthest from the heat 420 sink, respectively, resulting in higher temperature on planar 421 tier  $t_z$  compared to  $t_1$ .

*Besign Objectives:* Next, we explain the evaluation of the design objectives: latency, area, accuracy, and power. We can get accurate values for all these objectives for any candidate design by performing cycle-accurate simulations, which are very expensive. Since we need to evaluate many design choices to solve the MOO problem shown in (1), we consider surrogate design objectives elaborated below for tractable optimization.

<sup>430</sup> Latency (Lat): We evaluate end-to-end latency incurred in <sup>431</sup> DNN training for a candidate design (Lat(d)) considering a <sup>432</sup> 3-D mesh-based NoC architecture. The latency for a candidate <sup>433</sup> design is proportional to the sum of the computation and <sup>434</sup> communication latency while executing the training task on the 3-D heterogeneous architecture given by (2). Computation dur- 435 ing DNN training involves computing activations (Act), and 436 gradients [activations gradients ( $\Delta AG$ ) and weight gradients 437  $(\Delta WG)$ ] in the forward- and back-propagation phases, respec- 438 tively. Both phases have different precision requirements. In 439 contrast to Act computation,  $\Delta AG$  and  $\Delta WG$  computation 440 requires PEs with a PIM device that has high precision 441 and high endurance due to large number of repeated write 442 operations [24]. Hence, the neural layer computation in a 443 training task is spread out on different 3-D planar tiers, where 444 each tier consists of PEs constituting of a specific device type. 445 This generates on-chip communication traffic, which depends 446 on the layer-to-PE, and the PE-to-tier mapping. The end-to-end 447 compute latency for a DNN workload depends on the compute 448 latency incurred by the individual neural layers mapped to 449 either SRAM-, ReRAM-, or FeFET-based PEs (Latency<sub>SIRIF</sub>), 450 as shown in (3). Similarly, the latency associated with sending 451 Act,  $\Delta AG$ , or  $\Delta WG$  from  $PE_i$  to  $PE_i$  depends on the placement 452 of PEs and contributes to the communication latency given by 453 (4), where  $F_{ij}$  is either Act,  $\Delta AG$ , or  $\Delta WG$  as defined above. 454 The parameter  $M_{ij}$  is the corresponding Manhattan distance 455 between  $PE_i$  and  $PE_i$ 456

$$Lat(d \propto \mathcal{L})_{compute} + \mathcal{L}_{comm}$$
(2) 457

Κ

$$\mathcal{L}_{\text{compute}} \propto \sum_{i=1} \left[ \text{Latency}_{S|R|F}(W_i + \text{Act}_i) \right]$$
(3) 458

$$\mathcal{L}_{\text{comm}}(i,j) \propto F_{ij} \cdot M_{ij} \ \forall F_{ij} \in \{\text{Act}, \ \Delta AG, \Delta WG\}.$$
(4) 459

*Area (Ar)*: It is desirable to execute a given DNN training 460 task using less resources (PEs) to improve the compute 461 efficiency (TOPS/mm<sup>2</sup>). The number of PEs needed to map a 462 given neural layer depends on its device type. For example, 463 SRAMs have larger footprint ( $150F^2$ ) compared to ReRAM 464 cells ( $4F^2$ ), where F is the minimum feature size as mentioned 465 in Table I. Hence, a neural layer mapped to SRAM-based PEs 466 would require a higher number of PEs than if it were otherwise 467 mapped to ReRAM-based PEs, leading to comparatively lower 468 TOPS/mm<sup>2</sup>. The design objective Ar corresponds to the sum of 469 computational resources needed to execute *K* layers of a DNN, 470 where PEs needed for the *i*th neural layer (weights  $w_i$  and 471

<sup>472</sup> activations Act<sub>i</sub>), depending on the PIM-device (Area<sub>S|R|F</sub>)

473 
$$Ar(d) \propto \sum_{i=1}^{K} \left[ \operatorname{Area}_{S|R|F}(w_i + \operatorname{Act}_i) \right].$$
(5)

Accuracy (Err): Prior work has shown that DNN models 474 475 can be trained to be robust against conductance drift in NVM 476 devices using techniques, such as adaptive noise injection, 477 negative feedback training, etc. [25], [26]. For example, 478 the injection of Gaussian noise is widely used to improve 479 robustness of DNN training executed on NVM-based architec-480 tures [13]. In a 3-D manycore architecture, the thermal noise <sup>481</sup> mainly depends on the placement of the PEs and their mutual 482 interactions. Hence, the exact noise and specific layer-wise 483 weight deviation ( $\sigma$ ) is not known prior to neural layer-to-484 PE and PE-to-tier mapping on a given architecture. Thus, 485 even with a model trained with conductance drift incorporated, 486 the actual thermal noise depends on the neural layer-to-PE 487 mapping and the location of the planar tier where the PE 488 is placed. This necessitates the consideration of accuracy as 489 one of the objectives in the MOO formulation. It should be <sup>490</sup> noted that executing DNN training for each mapping candidate solution d in the optimization phase is costly. Hence, we 491 <sup>492</sup> model the loss in accuracy by capturing the deviation in stored weights and activations due to thermal noise. 493

3-D architectures with multiple stacked planar tiers are 494 495 prone to thermal hotspots, which causes variations in stored 496 DNN weights and activations especially in NVM devices 497 (FeFET and ReRAM). This leads to a degradation in the DNN <sup>498</sup> accuracy. However, SRAM is known to be more tolerant to the <sup>499</sup> thermal noise compared to ReRAM and FeFET devices [14]. 500 Hence, to achieve high DNN accuracy, it is desirable to execute <sup>501</sup> high precision computations (involved in the back-propagation 502 phase) on PEs with a PIM device that is more resilient to <sup>503</sup> thermal noise. Thus, computations involved in a neural layer 504 in different DNN training phases, i.e., forward- and back-505 propagation phases need to be mapped on different types 506 of PEs to achieve high training accuracy. Furthermore, these different PEs can be mapped to planar tiers such that loss in 507 <sup>508</sup> DNN accuracy due to thermal noise is mitigated. For example, 509 PEs with NVM devices should be placed closer to the heat 510 sink, while SRAM-based PEs can be mapped to a planar tier 511 farther from the heat sink.

In addition, a layer-to-PE and PE-to-tier mapping also needs 512 513 to be considered for different NVM devices. This is crucial 514 because thermal noise impacts variations in weights/activation 515 of various NVM devices differently. For example, weights 516 and activations of the neural layers are stored in ReRAM 517 cells as conductance states. As the temperature increases, 518 the OFF-state conductance of ReRAM cells increases expo-<sup>519</sup> nentially, and the noise margin reduces [23]. On the other 520 hand, the noise margin of FeFET devices, characterized by the <sub>521</sub> memory window, reduces linearly with the increase in tem-<sup>522</sup> perature [5]. For weight variation  $(\Delta w)$ , we adopt a Gaussian <sub>523</sub> distribution with  $\Delta w \sim \mathcal{G}(0, \sigma^2)$ , where  $\sigma$  represents standard 524 deviation of weights, consistent with prior work [13]. The 525 variation of weights/activations belonging to different DNN 526 layers impact the model accuracy differently. The impact of weights/activations variations due to thermal noise is captured 527 by loss in accuracy (Err) given by (6) and (7). Hence, 528 Err depends on the neural layer mapping,  $PE_i$  temperature  $Q_i$ , 529 and DNN layer weights  $w_i$  and activations  $Act_i$  530

$$\operatorname{Err}(d) = \sum_{i=1}^{C} (w_i + \operatorname{Act}_i) \cdot \mathcal{N}(i)$$
(6) 531

$$\mathcal{N}(i) = \begin{cases} \exp[Q_i], \text{ ReRAM} \\ Q_i, \text{ FeFET} \\ \sim 0, \text{ SRAM} \end{cases}$$
(7) 532

To estimate the temperature  $Q_i$  of each PE, our framework <sup>533</sup> utilizes the thermal model from prior work, which considers <sup>534</sup> both vertical and horizontal heat flow, given by [27] <sup>535</sup>

$$Q_{o,z} = \left\{ \sum_{u=1}^{z} \left( P_{o,u} \sum_{v=1}^{u} R_{v} \right) + R_{b} \sum_{u=1}^{z} P_{o,u} \right\} * \Phi_{H}$$
(8) 536

where  $P_{o,u}$  is the power consumption of the PEs *u* tiers away <sup>537</sup> from the sink in a vertical stack *o* and is a function of the <sup>538</sup> neural layer to PE mapping,  $\Phi_H$  represents the lateral heat <sup>539</sup> flow,  $R_v$  is the thermal resistance in vertical direction, and  $R_b$  <sup>540</sup> is the thermal resistance of the base layer on which the die <sup>541</sup> is placed and *z* represents the *z*th tier where PEs are located. <sup>542</sup> Values of  $R_v$  and  $R_b$  depend on the material characteristics and <sup>543</sup> are calibrated using HotSpot [28]. <sup>544</sup>

*Power* (*Pwr*): The PE power consumption  $P_{o,u}$  in (8) 545 depends on the DNN training task, layer-to-PE, and PE-to- 546 tier mapping. The total computation power corresponds to 547 the power incurred while computing the individual neural 548 layers mapped to either SRAM-, ReRAM-, or FeFET-based 549 PEs (Power<sub>S|R|F</sub>), as shown in (10). Further, routers and links  $_{550}$ associated with the PEs dissipate significant power due to high 551 data exchange between the neural layers. If two subsequent 552 neural layers exchanging large number of activations are 553 mapped on to the PEs far apart, then such mapping creates 554 traffic bottleneck due to frequent long distance data transfer. 555 This creates unnecessary congestion resulting in increase in 556 the communication power. The communication power required 557 to transfer data from  $PE_i$  to  $PE_i$  is given by (11), where 558  $F_{ii}$  is either activations (Act) in forward phase, or activation 559 gradients ( $\Delta AG$ ) and weight gradients ( $\Delta WG$ ) in back- 560 propagation phase, communicated from  $PE_i$  to  $PE_j$  and  $M_{ij}$  is 561 the corresponding Manhattan distance between  $PE_i$  and  $PE_i$ 562

$$Pwr(d) \propto P_{\text{compute}} + P_{\text{comm.}}$$
 (9) 563

$$P_{\text{compute}} \propto \sum_{i=1}^{K} \left[ \text{Power}_{S|R|F}(w_i + \text{Act}_i) \right]$$
(10) 564

$$P_{\text{comm.}}(i,j) \propto F_{ij} \cdot M_{ij} \quad \forall F_{ij} \in \{\text{Act, } \Delta AG, \Delta WG\}.$$
(11) 566

AMOSA-Based MOO Approach: In this section, we discuss 566 the algorithmic procedure to compute the Pareto-optimal 567 set of designs (neural layer-to-PE and PE-to-tier mappings). 568 Algorithm 1 shows a high-level pseudocode for our design 569 optimization methodology based on the well-known AMOSA 570 solver [29]. The goal is to distribute the computations of K 571 DNN layers (forward- and back-propagation phases) across C 572 PEs on Z planar tiers of different device types to obtain optimal 573

| Algorithm 1: Neural Layer-to-PE and PE-to-3-D Planar |                                                                      |  |  |  |  |  |
|------------------------------------------------------|----------------------------------------------------------------------|--|--|--|--|--|
| Ti                                                   | Tier Mapping                                                         |  |  |  |  |  |
| I                                                    | <b>Input</b> : Target manycore system with C PEs of PIM              |  |  |  |  |  |
| d                                                    | evice types- SRAM, ReRAM or FeFET                                    |  |  |  |  |  |
| A                                                    | PP = DNN training task                                               |  |  |  |  |  |
| C                                                    | <b>Output</b> : $D^*$ , the Pareto optimal set of designs (optimized |  |  |  |  |  |
| neural layer-to-PE mapping and PE-to-tier mappings)  |                                                                      |  |  |  |  |  |
| 1                                                    | <b>Initialize</b> : $D =$ non-dominated set of solutions; $A =$      |  |  |  |  |  |
|                                                      | Archive                                                              |  |  |  |  |  |
| 2                                                    | Input variables $(\vec{x})$ = neural layer-to-PE mapping             |  |  |  |  |  |
|                                                      | $(\pi)$ and PE-to-tier mapping $(\alpha)$                            |  |  |  |  |  |
| 3                                                    | Repeat                                                               |  |  |  |  |  |
| 4                                                    | Select one $\vec{x}$ from A and Perturb $\vec{x}$ to get a           |  |  |  |  |  |
|                                                      | design d                                                             |  |  |  |  |  |
| 5                                                    | design $d \leftarrow$ Candidate mapping of neural                    |  |  |  |  |  |
|                                                      | layer-to-PEs and PEs-to-3-D planar tiers                             |  |  |  |  |  |
| 6                                                    | Evaluate(design d, APP)/* using power, area,                         |  |  |  |  |  |
|                                                      | latency, and accuracy models [Section III-B]*/                       |  |  |  |  |  |
| 7                                                    | Update non-dominated set of solutions D via                          |  |  |  |  |  |
|                                                      | Pwr(d), Ar(d), Lat(d), Err(d)                                        |  |  |  |  |  |
| 8                                                    | Update Archive A                                                     |  |  |  |  |  |
| 9                                                    | Until convergence or maximum iterations                              |  |  |  |  |  |
| 10                                                   | Pareto optimal set of designs $D^* \leftarrow D$                     |  |  |  |  |  |
| 11                                                   | <b>return</b> $\hat{D}^*$ , the Pareto optimal set of designs        |  |  |  |  |  |
|                                                      | (optimized neural layer-to-PE mapping and PE-to-tier                 |  |  |  |  |  |
|                                                      | mappings)                                                            |  |  |  |  |  |

<sup>574</sup> tradeoffs between Pwr, Ar, Lat, and Err. The input variables  $\vec{x}$ 575 in our MOO approach are the neural layer-to-PE mapping  $(\pi)$ 576 and PE-to-tier mapping ( $\alpha$ ). A candidate configuration of  $\vec{x}$ 577 corresponds to the design d which is a candidate mapping of <sup>578</sup> neural layer-to-PEs and PEs-to-3-D planar tiers (Algorithm 1, 579 line 5). First, we start with a randomly chosen mapping of 580 DNN layers to PEs and PEs to planar tiers satisfying the <sup>581</sup> mapping constraints: 1) a neural layer is mapped on to PEs 582 of one device type and 2) a planar tier consists of PEs of 583 one device type. It should be noted that it is possible for a <sup>584</sup> neural layer to be mapped to different types of PEs in different 585 tiers. However, this gives rise to synchronization issues as each <sup>586</sup> type of PE has different latency and throughput. Computations 587 involved in one neural layer need to be completed and the <sup>588</sup> activations must then be sent to the next neural layer. If a layer 589 is mapped on two different types of PEs with unequal timing 590 characteristics, then the computation latency for a particular <sup>591</sup> neural layer will be bottlenecked by the PE with the worst-case <sup>592</sup> delay. This will lead to a degradation in the overall training <sup>593</sup> performance. Hence, each neural layer is mapped on to PEs 594 of one device type. Also, due to fabrication challenges, we 595 refrain from integrating different types of NVM devices on 596 the same tier.

Next, we perturb a candidate mapping solution to get a new layer-to-PE and PE-to-tier mapping (Algorithm 1, line solution 4). Here, a valid perturbation is defined as allocating a andomly chosen neural layer to a different PE such that the mapping constraints mentioned above, are satisfied. In each AMOSA iteration, the selected design is evaluated using the

TABLE II PIM Architecture Specifications

| 64 PEs distributed over 4 tiers (16 PEs/tier), 4 tiles/PE                                                 |                                                                                                                                    |  |  |  |
|-----------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------|--|--|--|
| ReRAM Tile                                                                                                | 96 SAR ADCs (8-bits), 128×96 DACs (1-bit), 96<br>crossbars, 128×128 crossbar array, 2-bit/cell<br>resolution, 0.40 mm <sup>2</sup> |  |  |  |
| FeFET Tile                                                                                                | 256×48 S/A (1-bit), 48 crossbars, 256×256<br>crossbar array, 1-bit/cell resolution, 0.40 mm <sup>2</sup>                           |  |  |  |
| 6T, 1-bit-cell, 256 S/A, 8KB SRAM arraSRAM Tile(256×256), 9 column/row-decoder, 9 SR/<br>arrays, 0.40 mm² |                                                                                                                                    |  |  |  |

surrogate objectives for latency, area, power, and accuracy 603 (Algorithm 1, line 6) and the nondominated set of designs 604 and Archive are updated based on this new design evaluation. 605 At convergence or after maximum iterations, we get the 606 Pareto-optimal set of designs  $D^*$  from the MOO solver. We 607 first select the feasible designs from  $D^*$  fulfilling the DNN 608 accuracy constraint mentioned above (e.g., 1% accuracy loss 609 with respect to ideal condition) by performing cycle-accurate 610 simulations. Finally, we select the best design  $d_{\text{best}}$  from the 611 feasible designs that achieves the best performance in-terms 612 of either energy-efficiency (TOPS/W) or compute-efficiency 613 (TOPS/mm<sup>2</sup>). It should be noted that the HuNT framework 614 optimizes layer-to-PE and PE-to-tier mapping at the design 615 time for a given DNN workload and any other MOO solver 616 can also be used to the same effect. 617

#### IV. EXPERIMENTAL RESULTS AND ANALYSIS

In this section, we present comprehensive experimental 619 results for the HuNT-enabled 3-D heterogeneous PIM archi-620 tecture for DNN training. 621

# A. Experimental Setup 622

The HuNT optimization phase (described in Algorithm 1) <sup>623</sup> is executed for 100 iterations, as this is sufficient to ensure <sup>624</sup> the convergence of the AMOSA-based MOO. Algorithm 1 is <sup>625</sup> executed at the design time; hence the time overhead is a <sup>626</sup> one-time cost. The overall time complexity of the adapted <sup>627</sup> AMOSA-based MOO solver is given by  $O(T \times N \times (M + 628)$  $\log(N)))$ , where *T* is the total iterations of the algorithm, *N* is <sup>629</sup> the maximum number of nondominated solutions stored in the <sup>630</sup> Archive, and *M* is the number of design objectives [29]. HuNT <sup>631</sup> generates the optimized neural layer-to-PE mapping, and the <sup>632</sup> corresponding PE-to-tier mapping, which is then mapped to <sup>633</sup> the proposed 3-D heterogeneous PIM-based architecture.

*3-D Heterogeneous PIM Architecture:* The PIM architecture  $^{635}$  considered in this work consists of a total of 64 PEs distributed  $^{636}$  over four planar tiers and connected using TSV-based vertical  $^{637}$  links. Each PIM-based PE has its unique configuration, such  $^{638}$  as crossbar size, cell resolution, number of crossbars/6T cells,  $^{639}$  etc., as shown in Table II. We consider an *iso-PE area setting*,  $^{640}$  such that all PEs (irrespective of their device type) have  $^{641}$  the same area but different amount of storage and compute  $^{642}$  capability. Considering the storage capacity of each PIM-based  $^{643}$  PE, the HuNT-enabled architecture can have a storage capacity  $^{644}$  of up to  $\sim$ 75 MB. Each planar tier consists of 16 PEs of a  $^{645}$  particular PIM device type (SRAM/ReRAM/FeFET).

|            | # Layers | Learning<br>rate | Batch size | # Params |
|------------|----------|------------------|------------|----------|
| VGG11      | 11       | 0.01             | 64         | 1.5M     |
| VGG16      | 16       | 0.01             | 64         | 2.2M     |
| ResNet18   | 18       | 0.05             | 128        | 1.1M     |
| ResNet34   | 34       | 0.05             | 128        | 2M       |
| DenseNet40 | 40       | 0.01             | 128        | 900K     |

TABLE III DNN Workloads With CIFAR-10 Dataset

The area, energy, and latency of the SRAM, ReRAM, 647 648 and FeFET devices and their associated peripheral circuits, 649 such as ADC, sense-amps (S/A), DACs, buffers, column-650 /row-decoders, and nonlinear activation units (ReLU), were 651 modeled via NeuroSim [24]. The connectivity between PEs 652 follows the 3-D mesh topology and the workload-dependent 653 inter-PE traffic is given as input to BookSim to estimate 654 communication power and latency [30]. We employ HotSpot's 655 default ambient temperature setting of 300 K to conduct ther-656 mal analysis with the power traces generated using NeuroSim 657 and BookSim [28]. Finally, we model the thermal effects [shown in (6) and (7)] on the DNN accuracy using the PyTorch 658 wrapper in NeuroSim for the different DNN models and 659 660 datasets considered in this work. Following prior work, we use 16-bit fixed-point precision for the storage and computation 661 of the DNN weights and activations in the forward pass, and 662 32-bit floating-point precision for the weight- and activation-663 gradient computation in the back-propagation phase [31]. The 664 665 3-D heterogeneous architecture utilizes a multicast-enabled 666 3-D mesh NoC as the interconnection backbone for com-667 municating between the PEs during DNN training [23]. In 668 our experimental evaluation, we consider energy-efficiency 669 (TOPS/W) and compute-efficiency (TOPS/mm<sup>2</sup>) as the two 670 relevant performance metrics that capture the latency, area, and power objectives considered in Section III of this work. 671

DNN Models and Datasets: We evaluate the performance 672 673 of the HuNT design optimization framework considering the 674 CIFAR-10, CIFAR-100, and TinyImageNet datasets with five diverse DNN models, namely: VGG11, VGG16, ResNet18, 675 676 ResNet34, and DenseNet40. Table III shows the characteristics 677 and parameters of the DNN models executed on the HuNT-678 enabled 3-D heterogeneous PIM architecture. As shown in 679 Table III, the largest network considered in this work (VGG16) has about 2.2M parameters which requires  $\sim$ 4.4 MB of storage, 680 681 hence it can be easily stored on the HuNT-enabled 3-D 682 heterogeneous architecture (with a storage capacity of up to  $\sim$ 75 MB) along with its activations and layer-wise gradients. 683 684 However, for larger networks where the neural network size 685 exceeds the total storage capacity of the PEs in the system, 686 then we need to read/write weights and activations from/to 687 main memory (DRAM). As a result, there will be an additional 688 latency penalty corresponding to that. However, the layer-to-PE and PE-to-tier mapping obtained from the HuNT optimization 690 framework is unimpacted by the off-chip memory accesses in the 691 case of very large DNNs. In this work, we train the DNN models 692 on the HuNT-enabled 3-D heterogeneous PIM architecture for 200 epochs using the Stochastic Gradient Descent method to 693 694 ensure their training convergence without overfitting.



Fig. 3. Layer-to-PE and PE-to-tier mapping tradeoffs while running the DNN training task for ResNet34 model on the CIFAR-10 dataset.

695

## B. Layer-to-PE and PE-to-Tier Mapping Tradeoffs

The neural layer-to-PE and PE-to-tier mapping affect the 696 overall latency, power, area, and DNN accuracy. The aim of the 697 HuNT framework is to determine the optimum configuration 698 of the heterogeneous 3-D manycore architecture that achieves 699 a suitable balance among all these metrics. Fig. 3 presents the 700 Pareto front considering the above-mentioned design objectives, 701 while executing the training task on ResNet34 model using 702 the CIFAR-10 dataset as an example. Recall, the PE-to-tier 703 mapping is represented by  $\alpha = [t_1, t_2, \dots, t_7]$ , where a planar 704 tier  $t_z$  has PEs of one device type—ReRAM (R), FeFET (F), 705 and SRAM (S). Fig. 3 shows a representative Pareto-optimal 706 set of designs  $D^*$  highlighted in black. It should be noted that 707 all the Pareto-optimal configurations with heterogeneous PEs 708 have the SRAM devices at the bottom tier, away from the 709 heat sink, which is used for the gradient calculation during 710 back-propagation. Further, to minimize the on-chip hardware 711 resources for gradient computation, we do not need to process all 712 the layers simultaneously, but just perform layer-by-layer weight 713 gradient computation, following prior work [24]. Therefore, 714 one tier of SRAM-based PEs is enough to support the layer 715 with largest size of activation gradients for the DNN models 716 considered here. Due to the necessity of the SRAM tier for 717 the back-propagation, the homogeneous configurations where 718 we have only one type of NVM PIM device like FeFET or 719 ReRAM are:  $[F_1, F_2, F_3, S_4]$  and  $[R_1, R_2, R_3, S_4]$ . Alternatively, 720 the homogeneous configuration with only SRAM device is: 721  $[S_1, S_2, S_3, S_4]$ . All the design objectives shown in Fig. 3 are 722 normalized with respect to a mapping corresponding to  $\alpha = 723$  $[S_1, F_2, F_3, R_4]$  (shown in green), since it has the worst DNN 724 accuracy. As mentioned earlier, impact of thermal noise on 725 ReRAM-based PEs is more severe compared to FeFET- or 726 SRAM-based PEs. Thus, the mapping  $[S_1, F_2, F_3, R_4]$  has the 727 worst DNN accuracy because 1) the high power consuming 728 FeFET-based PEs on two planar tiers lead to thermal hotspots 729 and 2) ReRAM-based PEs are mapped to the planar tier 730 farthest from the heat sink. On the other hand, candidate 731 mappings with all thermal noise resilient SRAM-based PEs, 732 i.e.,  $[S_1, S_2, S_3, S_4]$ , achieve the highest DNN accuracy, but 733 at the cost of extremely high area. The FeFET-based PEs 734 contribute to high power density in the mapping corresponding 735 to  $[F_1, F_2, F_3, S_4]$ , resulting in peak temperature of 380 K and 736



Fig. 4. Comparison of various Pareto-optimal layer-to-PE and PE-to-tier mappings in terms of (a) TOPS/W and (b) TOPS/mm<sup>2</sup>, while running training task for VGG16 model on the CIFAR-10 dataset as an example.

<sup>737</sup> lower DNN accuracy compared to  $[R_1, R_2, R_3, S_4]$ . However, <sup>738</sup> the mapping  $[R_1, R_2, R_3, S_4]$  incurs high write latency due to <sup>739</sup> predominantly ReRAM-based PEs when compared to mappings <sup>740</sup> on predominantly SRAM- or FeFET-based PEs. Thus, all <sup>741</sup> homogeneous architectures score high in one specific design <sup>742</sup> metric neglecting the others. On the other hand, heterogeneous <sup>743</sup> 3-D architectures, such as  $[R_1, F_2, F_3, S_4]$  and  $[R_1, R_2, F_3, S_4]$ , <sup>744</sup> exploit device heterogeneity with optimal layer-to-PE and PE-<sup>745</sup> to-tier mapping, and achieve suitable tradeoffs between power, <sup>746</sup> latency, area, and DNN accuracy.

Next, we implement the HuNT-enabled Pareto-optimal 747 748 set of designs  $D^*$  and evaluate their performance in real-749 istic settings. Fig. 4(a) and (b) shows the comparative 750 performance evaluation of the architectures in terms of energy-<sup>751</sup> efficiency (TOPS/W) and compute-efficiency (TOPS/mm<sup>2</sup>) 752 for DNN training task on VGG16 model with CIFAR-10 753 dataset as an example. As shown in Fig. 4, neural layer 754 mapping corresponding to a homogeneous SRAM-based PE rss configuration, i.e.,  $[S_1, S_2, S_3, S_4]$ , leads to the lowest TOPS/W <sup>756</sup> and TOPS/mm<sup>2</sup> due to higher power and area consumption when compared to FeFET- and ReRAM-based architectures. 757 <sup>758</sup> Similarly,  $[F_1, F_2, F_3, S_4]$  achieves low TOPS/W due to high power FeFET-based PEs. As shown in Fig. 4, the layer-to-759 760 PE and PE-to-tier mapping corresponding to  $[R_1, R_2, F_3, S_4]$ (highlighted in red) achieves highest TOPS/W and TOPS/mm<sup>2</sup> 761 762 compared to rest of the Pareto-optimal candidate mappings. This mapping utilizes the ReRAM and FeFET-based PEs 763 764 (on planar tiers 1-3) for low precision computation in the 765 forward phase and SRAM-based PEs (on planar tier 4) for 766 high precision gradients computation in the back-propagation phase. Further, the DNN layers processing high number of 767 activations are mapped to dense ReRAM-based PEs, result-768 769 ing in higher TOPS/mm<sup>2</sup> and closer to the SRAM tier, 770 reducing the communication energy specifically during the 771 back-propagation phase. Hence, this results in higher TOPS/W. Next, we discuss the DNN layers' characteristics and 772 773 their role in layer-to-PE and PE-to-tier mapping for the <sup>774</sup> best-performing  $[R_1, R_2, F_3, S_4]$  architecture. Fig. 5 shows 775 layer-wise mapping on to PEs and 3-D planar tiers for 776 training DenseNet40 and VGG16 models with CIFAR-10 777 dataset as an example. As discussed earlier, high precision 778 gradients are calculated in the bottom tier (tier S4) and the 779 forward phase computation is executed on tiers 1–3 of the  $[R_1, R_2, F_3, S_4]$  architecture. As shown in Fig. 5(a), initial 780 lavers in DenseNet40 process higher number of activations 781 than the latter layers and need more crossbars to store weights 782 and activations. Therefore, these layers are mapped to dense, 783 low power ReRAM-based PEs (R2) as well as closer to tier S4 784 for faster exchange of gradients. On the contrary, latter layers 785 with comparatively fewer activations and smaller kernels, are 786 mapped on tier F3 (layers 14-24) and R1 (layers 30-40). 787 However, as shown in Fig. 5(b), the layer-wise characteristics 788 of VGG16 are different than that of DenseNet40, i.e., initial 789 layers process higher number of activations but have less 790 crossbars requirement for storage and computation, due to the 791 layers' input/output feature map and kernel size. Thus, the 792 initial layers of VGG16 are mapped to FeFET-based PEs on 793 tier F3 that have low latency but less dense when compared 794 to ReRAMs. On the contrary, the middle layers consist of 795 wider kernels and require more crossbars. Thus, these layers 796 are mapped to dense, low power ReRAM-based PEs on tier 797 R2. This highlights the importance of considering DNN layers' 798 characteristics while finding optimal layer-to-PE and PE-to-799 tier mapping to achieve high compute- and energy-efficiency. 800

## C. Overall Performance Evaluation

In this section, we present a thorough performance evalu- 802 ation of the HuNT-enabled DNN layer-to-PE and PE-to-tier 803 mapping for the proposed 3-D heterogeneous PIM architecture 804 during DNN training. Fig. 6(a)-(c) compares the energy-, 805 compute-efficiency, and accuracy of the HuNT-enabled 3-D 806 heterogeneous architecture (simply referred to as HuNT here 807 after) with the homogeneous and existing heterogeneous 808 counterparts for all DNN workloads considered in this work 809 with the CIFAR-10 dataset, respectively. For this compar- 810 ison, the homogenous configurations are  $[F_1, F_2, F_3, S_4]$ , 811  $[R_1, R_2, R_3, S_4]$ , and  $[S_1, S_2, S_3, S_4]$  as mentioned earlier. The 812 existing heterogeneous counterparts considered in our com- 813 parative performance evaluation include the HyperX and 814 AccuReD architectures [22], [23]. As discussed in the related 815 work, HyperX leverages both ReRAM (R) and SRAM (S), 816 while AccuReD leverages ReRAM- and GPU-based PEs to 817 achieve high-performance DNN training. In our comparative 818 performance evaluation with respect to HuNT, we use the two 819 tiers of ReRAM and two GPU tiers  $[R_1, R_2, GPU_3, GPU_4]$  820 configuration, and the  $[R_1, S_2, S_3, S_4]$  configuration for the 821 AccuReD and HyperX architectures, respectively [22], [23]. 822

As shown in Fig. 6(a) and (b), HuNT achieves up to 20 <sup>823</sup> TOPS/W and 10.73 TOPS/mm<sup>2</sup> on the CIFAR-10 dataset <sup>824</sup> which corresponds to a ~10× and ~8× improvement in <sup>825</sup> energy- and compute-efficiency, respectively, over the all-SRAM homogenous counterpart. As shown earlier in Table I, <sup>827</sup> and corroborated in the literature, SRAM-based PIM architectures generally suffer from low energy- and compute-efficiency <sup>829</sup> due to their high leakage power and significant area overhead, <sup>830</sup> respectively [3]. Hence, they achieve a relatively low energyand compute-efficiency of 2.2 TOPS/W and 1.1 TOPS/mm<sup>2</sup> on average across all DNN models as shown in Fig. 6(a) <sup>833</sup> and (b), respectively. HuNT exploits device heterogeneity and DNN workload awareness to achieve up to a  $1.2 \times$  and  $1.3 \times$  <sup>835</sup> improvement in TOPS/W and TOPS/mm<sup>2</sup>, respectively, over <sup>836</sup>



Fig. 5. Layer-to-PE and PE-to-tier mapping for DNN training task on (a) DenseNet40 and (b) VGG16 models with CIFAR-10 dataset on (c) optimized  $[R_1, R_2, F_3, S_4]$  architecture. Here, R1 and R2 refer to Tiers 1 and 2 with ReRAM-based PEs, respectively, F3 refers to Tier 3 with FeFET-based PEs, and S4 refers to Tier 4 with SRAM-based PEs.



Fig. 6. Performance evaluation of the HuNT-enabled 3-D PIM architecture with state of the art in terms of (a) energy-efficiency (TOPS/W), (b) computeefficiency (TOPS/mm<sup>2</sup>), and (c) accuracy of DNN workloads executed on the CIFAR-10 dataset. Here, for brevity, we use FFFS, RRRS, and SSSS to refer to  $[F_1, F_2, F_3, S_4]$ ,  $[R_1, R_2, R_3, S_4]$ , and  $[S_1, S_2, S_3, S_4]$  homogeneous configurations, respectively.

<sup>837</sup> the homogeneous ReRAM configuration ( $[R_1, R_2, R_3, S_4]$ ). Similarly, HuNT achieves an improvement of up to  $2.6 \times$ 838 and  $1.5 \times$  in TOPS/W and TOPS/mm<sup>2</sup>, respectively, over the 839 homogeneous FeFET configuration  $([F_1, F_2, F_3, S_4])$  on the 840 CIFAR-10 dataset. Overall, HuNT outperforms HyperX and 841 AccuReD by  $3.1 \times$  and  $1.4 \times$ , respectively, on average in terms 842 of energy-efficiency, and by  $2.7 \times$  and  $1.5 \times$ , respectively, on 843 844 average in terms of compute efficiency on the CIFAR-10 845 dataset. This is because the high power and area of the SRAM-846 based PEs and the GPU-based PEs in HyperX and AccuReD, <sup>847</sup> respectively, make them less compute- and energy-efficient.

As shown in Fig. 6(c), we compare HuNT with the homoge-848 <sup>849</sup> nous and heterogeneous counterparts in terms of the accuracy. 850 Here, the all-SRAM configuration achieves the highest accu-<sup>851</sup> racy due its high reliability, and less vulnerability to thermal 852 issues in the 3-D architecture [23]. However, the homoge-853 neous FeFET and ReRAM counterparts suffer up to 4% 854 and 2.5% accuracy loss. This is due to high power con-855 sumption of FeFET-based PEs and limited thermal endurance ReRAM-based PEs when placed away from the heat 856 of 857 sink. Overall, HuNT achieves less than 1% accuracy drop 858 compared to the all-SRAM counterpart. In summary, our <sup>859</sup> performance evaluation demonstrates that the HuNT-enabled 3-D heterogeneous PIM architecture achieves high energy-860 and compute-efficiency over the homogenous counterparts and 861 <sup>862</sup> the existing heterogeneous PIM-based architectures (AccuReD 863 and HyperX). Overall, the HuNT-enabled 3-D PIM archi-<sup>864</sup> tecture achieves the highest TOPS/W and TOPS/mm<sup>2</sup> with 865 negligible loss in DNN accuracy.

## 866 D. Transferability Across Datasets

In this section, we demonstrate that the HuNT-enabled optimized layer-to-PE and PE-to-tier mapping  $(d_{best})$  obtained

using the CIFAR-10 dataset can be transferred to another 869 dataset for training on 3-D heterogeneous PIM architecture 870 without compromising the DNN training accuracy, and overall 871 performance. Here, the dbest for a given DNN workload is 872 generated with a source dataset via the HuNT framework, 873 and then mapped to the 3-D heterogeneous architecture for 874 training using a *target* dataset. Figs. 7 and 8 demonstrate the 875 transferability of  $d_{\text{best}}$  generated using the CIFAR-10 dataset 876 (as the source dataset) to the CIFAR-100 and TinyImageNet 877 datasets (as the target datasets), respectively. In Figs. 7 878 and 8, we consider  $d_{\text{best}}$  generated using CIFAR-100 and 879 TinyImageNet, respectively, via the HuNT framework in 880 each case as the *baseline*. In this work, we compare the 881 performance of dbest obtained using the CIFAR-10 dataset 882 with respect to the baselines in terms of energy-efficiency 883 (TOPS/W), compute-efficiency (TOPS/mm<sup>2</sup>), and the final 884 DNN test accuracy as shown in Figs. 7(a) and 8(a), Figs. 7(b) 885 and 8(b), and Figs. 7(c) and 8(c), respectively. Here, the 886 configurations are denoted as  $D_S \rightarrow D_T$ , where  $D_S$  represents <sup>887</sup> the "source" dataset and  $D_T$  represents the "target" dataset. 888 In our analysis, we consider the CIFAR-10→CIFAR-100 889 and CIFAR-10→TinyImageNet configurations for the sake of 890 brevity. However, it is worth noting that the HuNT framework 891 is compatible with other image datasets, and the results shown 892 here are reproducible for other configurations. 893

As shown in Fig. 7(a) and (b), we observe less than <sup>894</sup> an average of 2% and 1.5% loss in energy- and computeefficiency, respectively, compared to the baseline across all the DNN models for the CIFAR-10 $\rightarrow$ CIFAR-100 configuration. <sup>897</sup> Similarly, we also observe an average of 3.1% and 1.9% loss <sup>898</sup> in energy- and compute-efficiency, respectively, compared to the baseline for the CIFAR-10 $\rightarrow$ TinyImageNet configurations, <sup>900</sup> respectively, as shown in Fig. 8(a) and (b). Overall, we <sup>901</sup>



Fig. 7. Transferability from CIFAR-10 to CIFAR-100 dataset for (a) energy-efficiency (TOPS/W), (b) compute-efficiency (TOPS/mm<sup>2</sup>), and (c) accuracy.



Fig. 8. Transferability from CIFAR-10 to TinyImageNet dataset for (a) energy-efficiency (TOPS/W), (b) compute-efficiency (TOPS/mm<sup>2</sup>), and (c) accuracy.

<sup>902</sup> observe a negligible accuracy loss of less than 1% across <sup>903</sup> all DNN models considered in both the CIFAR-10→CIFAR-100 and CIFAR-10 $\rightarrow$ TinyImageNet configurations as shown 904  $_{905}$  in Figs. 7(c) and 8(c), respectively. Here, the transferability  $_{906}$  of the optimal neural layer mapping of  $d_{\text{best}}$  across datasets possible because the general DNN model behavior is 907 is <sup>908</sup> often transferable between datasets. This idea is similar to transfer learning, where a model trained on one dataset can 909 be reused with slight changes for another dataset [32]. Hence, 910 the neural layer mapping of  $d_{\text{best}}$  for a given DNN model 911 912 can also be used with other datasets, and achieve similar levels of performance (energy- and compute-efficiency) with 913 <sup>914</sup> negligible accuracy loss. However, it is worth noting that the 915 absolute values of the achievable performance in terms of 916 TOPS/W and TOPS/mm<sup>2</sup> vary across datasets due to their <sup>917</sup> unique characteristics. For example, the TinyImageNet dataset 918 generates more activations during training compared to the 919 CIFAR-10 dataset. This requires more PEs. Hence, for the 920 same system configuration, HuNT achieves lower computeand area-efficiencies for TinyImageNet compared to both 921 CIFAR-10 and CIFAR-100 datasets. The dataset characteristics 922 <sup>923</sup> influence the absolute achievable performance. However, the 924 overall trend is agnostic to the dataset. In addition, the  $_{925}$  transferability of  $d_{\text{best}}$  across datasets eliminates the cost of implementing repeated MOO for more complex datasets. In 926 essence, this further demonstrates the scalability and versatility 927 of the HuNT-enabled optimized layer-to-PE and PE-to-tier <sup>929</sup> mapping  $(d_{\text{best}})$  to other datasets for DNN training on 3-D 930 heterogeneous PIM accelerators.

# 931 E. Lifetime and Endurance Analysis

Lifetime and write endurance of NVM-based PIM devices 933 are crucial for DNN training due to significant number of

write operations required for the weight- and activation- 934 gradient calculations as well as weight updates in the back- 935 propagation phase. For our analysis, we consider realistic write 936 endurance limit for the FeFET-, ReRAM-, and SRAM-based 937 PEs reported in prior work, as shown in Table I. As discussed 938 earlier, the HuNT-enabled layer-to-PE mapping maps the 939 weights in the DNN layers to both ReRAM- and FeFET-based 940 PEs, and the weight- and activation-gradient computation is 941 performed on SRAM-based PEs (i.e., the  $[R_1, R_2, F_3, S_4]$  942 configuration). Therefore, the weights mapped to the ReRAM- 943 and FeFET-based PEs need to be reprogrammed during the 944 weight update phase. However, ReRAM and FeFET devices 945 suffer from low write endurance, which limits the number of 946 times that they can be reprogrammed before they fail due to 947 faults [17]. 948

In Fig. 9, we present a comparative performance trade- 949 off analysis between the energy-efficiency and endurance 950 of the homogeneous architectures, and the HuNT-enabled 951 heterogeneous architecture ([ $R_1$ ,  $R_2$ ,  $F_3$ ,  $S_4$ ]) executing the 952 VGG-11 DNN workload with the CIFAR-10 dataset. We 953 observe that beyond the endurance limit for each device, 954 the achievable performance (TOPS/W) begins to reduce, as 955 the number of resources (PEs) available to perform reliable 956 computation reduces due to failures of the NVM devices. 957 The  $[S_1, S_2, S_3, S_4]$  configuration achieves the lowest TOPS/W 958 due to its significant leakage power, however, it has the 959 highest endurance. Overall, the HuNT-enabled heterogeneous 960 architecture achieves an improvement of  $10 \times$ ,  $3 \times$ , and  $1.2 \times$  961 in terms of TOPS/W compared to the homogeneous SRAM- 962 , FeFET-, and ReRAM-based architectures, respectively. At 963 the same time, HuNT achieves similar write endurance as the 964 homogeneous configurations with at least one type of NVM 965 device  $([R_1, R_2, R_3, S_4]$  and  $[F_1, F_2, F_3, S_4])$ . 966



Fig. 9. Comparison of HuNT-enabled architecture with other homogeneous architectures in terms of energy-efficiency (TOPS/W) and endurance for training on VGG11 model with CIFAR-10 dataset as an example.

## V. CONCLUSION

PIM-based architectures enable high-performance and 968 <sup>969</sup> energy-efficient hardware accelerators for DNN training. 970 However, each PIM device has specific advantages and drawbacks. Hence, a heterogeneous architecture that combines 971 <sup>972</sup> multiple PIM devices in a single system is necessary to achieve <sup>973</sup> the suitable balance between all the required design metrics. <sup>974</sup> A 3-D architecture enables the design of such a heterogeneous <sup>975</sup> platform where each planar tier consists of PEs designed with 976 one type of device. This also avoids the fabrication challenges 977 of integrating disparate technologies on a single tier. In this work, we propose the HuNT framework, which finds an 978 979 optimal layer-to-PE and PE-to-tier mapping for 3-D PIMbased heterogeneous architectures. Overall, the HuNT-enabled 980 3-D heterogeneous architecture achieves up to a  $10 \times$  and  $8 \times$ 981 improvement in energy- and compute-efficiency, respectively, 982 over the homogenous counterparts and existing heterogeneous 983 PIM-based architectures without compromising accuracy. 984

#### REFERENCES

- [1] W. Liu et al., "A survey of deep neural network architectures and their applications," *Neurocomputing*, vol. 234, pp. 11–26, Apr. 2017.
- [2] L. Song, X. Qian, L. Hai, and Y. Chen, "PipeLayer: A pipelined ReRAM-based accelerator for deep learning," in *Proc. IEEE HPCA*, 2017, pp. 541–552.
- [3] K. Roy, I. Chakraborty, M. Ali, A. Ankit, and A. Agrawal, "In-memory computing in emerging memory technologies for machine learning: An overview," in *Proc. IEEE DAC*, 2020, pp. 1–6.
- [4] A. Shafiee et al., "ISAAC: A convolutional neural network accelerator
  with in-situ analog arithmetic in crossbars ali," in *Proc. ISCA*, 2016,
  pp. 14–26.
- <sup>997</sup> [5] T. Soliman et al., "First demonstration of in-memory computing crossbar using multi-level cell FeFET," *Nat. Commun.*, vol. 14, no. 1, p. 6348, 2023.
- Y. Long et al., "A ferroelectric FET-based processing-in-memory architecture for DNN acceleration," *IEEE J. Explor. Solid-State Computat. Devices Circuits*, vol. 5, no. 2, pp. 113–122, Dec. 2019.
- 1003 [7] A. Keshavarzi, K. Ni, W. Van Den Hoek, S. Datta, and
   1004 A. Raychowdhury, "FerroElectronics for edge intelligence," *IEEE Micro*,
   1005 vol. 40, no. 6, pp. 33–48, Nov./Dec. 2020.
- [8] A. Yusuf, T. Adegbija, and D. Gajaria, "Domain-specific STT-MRAMbased in-memory computing: A survey," *IEEE Access*, vol. 12, 2024, pp. 28036–28056.
- [9] G. Murali, X. Sun, S. Yu, and S. K. Lim, "Heterogeneous mixed-signal monolithic 3-D in-memory computing using resistive RAM," *IEEE Trans. Very Large Scale Integr. Syst.*, vol. 29, no. 2, pp. 386–396, Feb. 2021.
- 1013 [10] X. Peng et al., "Benchmarking monolithic 3D integration for computein-memory accelerators: Overcoming ADC bottlenecks and maintaining scalability to 7nm or beyond," in *Proc. IEEE IEDM*, 2020, pp. 30.4.1–
- 1016 30.4.4.

- [11] A. Kaul et al., "3-D heterogeneous integration of RRAM-based 1017 compute-in-memory: Impact of integration parameters on inference 1018 accuracy," *IEEE Trans. Electron Devices*, vol. 70, no. 2, pp. 485–492, 1019 Feb. 2023. 1020
- [12] M. Yayla et al., "FeFET-based binarized neural networks under 1021 temperature-dependent bit errors," *IEEE Trans. Comput.*, vol. 71, no. 7, 1022 pp. 1681–1695, Jul. 2022. 1023
- [13] X. Yang et al., "Multi-objective optimization of ReRAM crossbars for 1024 robust DNN inferencing under stochastic noise," in *Proc. IEEE/ACM* 1025 *ICCAD*, 2021, pp. 1–9.
- [14] A. Bhattacharjee, A. Moitra, and P. Panda, "HyDe: A hybrid 1027 PCM/FeFET/SRAM device-search for optimizing area and energy- 1028 efficiencies in analog IMC platforms," *IEEE J. Emerg. Sel. Topics* 1029 *Circuits Syst.*, vol. 13, no. 4, pp. 1073–1082, Dec. 2023. 1030
- Y. Sun et al., "CREAM: Computing in ReRAM-assisted energy- and 1031 area-efficient SRAM for reliable neural network acceleration," *IEEE* 1032 *Trans. Circuits Syst. I, Reg. Papers*, vol. 70, no. 8, pp. 3198–3211, 1033 Aug. 2023.
- [16] G. Krishnan et al., "Hybrid RRAM/SRAM in-memory computing for 1035 robust DNN acceleration," *IEEE Trans. Comput.-Aided Design Integr.* 1036 *Circuits Syst.*, vol. 41, no. 11, pp. 4241–4252, Nov. 2022. 1037
- [17] W. Wen, Y. Zhang, and J. Yang, "ReNEW: Enhancing lifetime for 1038 ReRAM crossbar based neural network accelerators," in *Proc. IEEE* 1039 *ICCD*, 2019, pp. 487–496.
- [18] C. Eckert et al., "Neural cache: Bit-serial in-cache acceleration of deep 1041 neural networks," in *Proc. 45th ISCA*, 2018, pp. 383–396. 1042
- [19] S. Spetalnick and A. Raychowdhury, "A practical design-space analysis 1043 of compute-in-memory with SRAM," *IEEE Trans. Circuits Syst. I, Reg.* 1044 *Papers*, vol. 69, no. 4, pp. 1466–1479, Apr. 2022. 1045
- [20] S. Roy, M. Ali, and A. Raghunathan, "PIM-DRAM: Accelerating 1046 machine learning workloads using processing in commodity 1047 DRAM," *IEEE J. Emerg. Sel. Topics Circuits Syste.*, vol. 11, no. 4, 1048 pp. 701–710, Dec. 2021. 1049
- [21] M. R. Haq Rashed, S. K. Jha, and R. Ewetz, "Hybrid analog- 1050 digital in-memory computing," in *Proc. IEEE/ACM ICCAD*, 2021, 1051 pp. 1–9.
- [22] A. Kosta et al., "HyperX: A hybrid RRAM-SRAM partitioned system for 1053 error recovery in memristive Xbars," in *Proc. DATE*, 2022, pp. 88–91. 1054
- [23] B. K. Joardar, J. R. Doppa, P. P. Pande, H. Li, and K. 1055 Chakrabarty, "AccuReD: High accuracy training of CNNs on 1056 ReRAM/GPU heterogeneous 3D architecture," *IEEE Trans. Comput.*- 1057 *Aided Design Integr. Circuits Syst.*, vol. 40, no. 5, pp. 971–984, 1058 May 2021. 1059
- [24] X. Peng et al., "DNN+NeuroSim V2.0: An end-to-end benchmarking 1060 framework for compute-in-memory accelerators for on-chip training," 1061 2020, arXiv:2003.06471.
- [25] N. Ye et al., "Improving the robustness of analog deep neural networks 1063 through a Bayes-optimized noise injection approach," *Commun. Eng.*, 1064 vol. 2, no. 1, p. 25, 2023.
- [26] Y. Qin, Z. Yan, W. Wen, X. S. Hu, and Y. Shi, "Negative feedback 1066 training: A novel concept to improve robustness of NVCiM DNN 1067 accelerators," 2023, arXiv:2305.14561. 1068
- [27] J. Cong, J. Wei, and Y. Zhang, "A thermal-driven floorplan- 1069 ning algorithm for 3D ICs," in *Proc. IEEE/ACM ICCAD*, 2004, 1070 pp. 306–313.
- [28] R. Zhang, M. Stan, and K. Skadron, "Hotspot 6.0: Validation, accel- 1072 eration and extension," IBM T. J. Watson Res. Center, Univ. Virginia, 1073 Charlottesville, VA, USA, Rep. CS-2015-04, 2015. 1074
- [29] S. Bandyopadhyay, S. Saha, U. Maulik, and K. Deb, "A 1075 simulated annealing-based multiobjective optimization algorithm: 1076 AMOSA," *IEEE Trans. Evol. Comput.*, vol. 12, no. 3, pp. 269–283, 1077 Jun. 2008. 1078
- [30] N. Jiang et al., "A detailed and flexible cycle-accurate network-on-chip 1079 simulator," in *Proc. IEEE ISPASS*, 2013, pp. 86–96. 1080
- [31] H. Jin et al., "ReHy: A ReRAM-based digital/analog hybrid PIM architecture for accelerating CNN training," *IEEE Trans. Parallel Distrib.* 1082 *Syst.*, vol. 33, no. 11, pp. 2872–2884, Nov. 2022. 1083
- [32] C. O. Ogbogu et al., "Accelerating graph neural network training on 1084 ReRAM-based PIM architectures via graph and model pruning," *IEEE* 1085 *Trans. Comput.-Aided Design Integr. Circuits Syst.*, vol. 42, no. 8, 1086 pp. 2703–2716, Aug. 2023. 1087
- [33] D. Niu et al., "Design of cross-point metal-oxide ReRAM empha- 1088 sizing reliability and cost," in *Proc. IEEE/ACM ICCAD*, 2013, 1089 pp. 17–23.

967