# OPIMA: Optical Processing-in-Memory for Convolutional Neural Network Acceleration

Febin Sunny<sup>(D)</sup>, Amin Shafiee<sup>(D)</sup>, Abhishek Balasubramaniam, Mahdi Nikdast<sup>(D)</sup>, *Senior Member, IEEE*, and Sudeep Pasricha<sup>(D)</sup>, *Fellow, IEEE* 

Abstract-Recent advances in machine learning (ML) have 1 2 spotlighted the pressing need for computing architectures that <sup>3</sup> bridge the gap between memory bandwidth and processing 4 power. The advent of deep neural networks has pushed tra-5 ditional Von Neumann architectures to their limits due to the 6 high latency and energy consumption costs associated with 7 data movement between the processor and memory for these 8 workloads. One of the solutions to overcome this bottleneck 9 is to perform computation within the main memory through 10 processing-in-memory (PIM), thereby limiting data movement 11 and the costs associated with it. However, dynamic random-12 access memory-based PIM struggles to achieve high throughput 13 and energy efficiency due to internal data movement bottlenecks 14 and the need for frequent refresh operations. In this work, 15 we introduce OPIMA, a PIM-based ML accelerator, architected 16 within an optical main memory. OPIMA has been designed to 17 leverage the inherent massive parallelism within main memory 18 while performing high-speed, low-energy optical computation to 19 accelerate ML models based on convolutional neural networks. 20 We present a comprehensive analysis of OPIMA to guide design 21 choices and operational mechanisms. In addition, we evaluate <sup>22</sup> the performance and energy consumption of OPIMA, comparing 23 it with conventional electronic computing systems and emerging 24 photonic PIM architectures. The experimental results show that 25 OPIMA can achieve 2.98× higher throughput and 137× better <sup>26</sup> energy efficiency than the best known prior work.

Index Terms—Convolutional neural networks, machine learn ing (ML) acceleration, photonic memory, processing-in-memory
 (PIM), silicon photonics.

#### I. INTRODUCTION

30

<sup>31</sup> **F** OR EMERGING machine learning (ML) models being <sup>32</sup> used across application domains [1], [2], [3], the exponen-<sup>33</sup> tial growth in their computational demands has significantly <sup>34</sup> outpaced the rate of advances in traditional computing archi-<sup>35</sup> tectures [4], [5]. The resulting Von Neumann bottleneck <sup>36</sup> that alludes to the memory wall problem [6], is a critical <sup>37</sup> challenge to overcome, to support modern ML workloads. <sup>38</sup> In response to the limitations posed by the Von Neumann

Manuscript received 6 August 2024; accepted 10 August 2024. This work was supported by the National Science Foundation (NSF) under Grant CNS-2046226, Grant CCF-1813370, and Grant CCF-2006788. This article was presented at the International Conference on Hardware/Software Codesign and System Synthesis (CODES + ISSS) 2024 and appeared as part of the ESWEEK-TCAD Special Issue. This article was recommended by Associate Editor S. Dailey. (*Corresponding author: Febin Sunny.*)

The authors are with the Electrical and Computer Engineering Department, Colorado State University, Fort Collins, CO 80523 USA (e-mail: febinps@ colostate.edu; amin.shafiee@colostate.edu; abhishek.balasubramaniam@ colostate.edu; mahdi.nikdast@colostate.edu; sudeep@colostate.edu).

Digital Object Identifier 10.1109/TCAD.2024.3446870

architecture, various alternative paradigms are being explored <sup>39</sup> by industry and academia. A promising alternate computing <sup>40</sup> paradigm involves in-memory computing or processing-inmemory (PIM) [7]. PIM architectures propose a departure <sup>42</sup> from traditional designs by integrating processing capabilities <sup>43</sup> within the memory subsystem. This integration aims to minimize data movement, reduce latency, and minimize energy <sup>45</sup> consumption associated with processing applications. <sup>46</sup>

Given that dynamic random-access memory (DRAM) is the 47 standard main memory technology today, it is an obvious 48 candidate for PIM. Several prior efforts have focused on 49 architecting DRAM-PIM [8], [9], [10]. However, conven-50 tional DRAM-based PIM systems have encountered challenges 51 in achieving high throughput and energy efficiency. These 52 challenges arise primarily due to internal data movement 53 bottlenecks and the necessity for frequent memory refreshes. 54 To address the energy and latency concerns associated 55 with refreshes, nonvolatile memory (NVM) technologies, 56 such as ReRAM [11], [12], spin-transfer torque RAM (STT-57 RAM) [13], and phase change material (PCM) memories [14], 58 [15], [16], have been considered. However, ReRAM and STT-59 RAM technologies face fabrication challenges and endurance 60 issues [17], [18]. ReRAM additionally suffers from resistance 61 drift over time, which impacts data readout accuracy [17]. 62

PCMs offer better energy efficiency, bit density, and 63 bandwidth than other NVMs. They can switch between two 64 physical states: 1) amorphous and 2) crystalline. In the context 65 of electrically controlled PCM (EPCM) devices, these phase 66 changes are induced by applying current through microheaters. 67 It is possible to regulate the phase shift from amorphous to 68 crystalline, enabling the creation of multilevel cells (MLCs) 69 to store more data by adjusting the extent of the material's 70 crystallization. However, utilizing the resistance in PCMs to 71 encode data poses challenges as the resistance values that PCMs 72 attain depend nonlinearly on the applied write voltage [19]. 73

To address these challenges, optically programmed PCM 74 (OPCM) cells can be considered [23]. OPCM cells are fab-75 ricated with PCM deposited on top of a photonic waveguide 76 and are programmed through laser pulses. Here, in place of 77 resistance, the refractive index of the PCM is the physical 78 property used to represent data. OPCMs can be programmed 79 using laser pulses guided to them through on-chip waveguides. 80 This makes them ideally suited for integration onto silicon 81 photonic platforms. OPCMs are based on silicon photonics, 82 which is an emerging field that integrates photonic systems 83 with electronics. This platform offers several advantages

1937-4151 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information. <sup>85</sup> over traditional electronic circuits, including high throughput
<sup>86</sup> and low energy consumption, for specialized computation
<sup>87</sup> tasks [19], [20], [21], [22]. Merging this computational
<sup>88</sup> capability with an OPCM main memory could allow for high<sup>89</sup> speed in-memory computation without the data movement and
<sup>90</sup> refresh bottlenecks seen in DRAM-PIM.

In this article, we explore how to architect a photonic main memory, to enable ML acceleration through PIM. We utilize the OPCM-based main memory from [23] as the backbone of for our architecture and make several changes to it to support PIM. We have named our photonic PIM architecture for ML acceleration, OPIMA.

- <sup>97</sup> The novel contributions in this article a follows.
- Scattering and back reflection-aware OPCM cell design to maximize bit-density and minimize read errors per cell.
- Full system design of an OPCM-based PIM architecture
   that can operate as a main memory while perform-
- ing PIM.
- 3) Comprehensive comparison of operational efficiency of
   OPIMA with state-of-the-art accelerators.

#### 105 II. BACKGROUND AND RELATED WORK

Before we discuss our PIM architecture and associated
 techniques, we review some fundamentals and background on
 PCMs, OPCM main memories, and photonic computing.

## 109 A. Phase Change Materials

PCMs possess the ability to shift between amorphous and 110 111 crystalline states, depending on the level of thermal energy <sup>112</sup> applied. This energy must be sufficient to alter the material's temperature to either its melting temperature  $(T_i;$  for transition-114 ing to the amorphous state) or its crystallization temperature  $_{115}$  ( $T_g$ ; for shifting to the crystalline state). Transitioning to the 116 amorphous state consumes more energy because its required <sup>117</sup> melting temperature exceeds the crystallization temperature. 118 It should be noted that it is possible to induce partial phase 119 changes within PCMs, creating intermediate states by converting only a fraction of the material to either state. These 120 121 transitions can be initiated through electrical or optical means. 122 Electrical heating can be provided through PN junctions 123 whereas optically achieving phase changes requires a laser 124 pulse, whose power and duration must be tailored to the 125 material's specific transition energy needs. Common mate-<sup>126</sup> rials used for PCM applications include Ge<sub>2</sub>Sb<sub>2</sub>Te<sub>5</sub> (GST), <sup>127</sup> Ge<sub>2</sub>Sb<sub>2</sub>Se<sub>4</sub>Te (GSST), and Sb<sub>2</sub>Se<sub>3</sub> [24].

The change in a PCM phase brings with it a change in the electrical and optical properties of the material. PCM's states have different electrical resistances and different optical refractive indices. These differences in characteristics can be leveraged for data representation, including multibit data representation, enabling dense PCM-based memories, and as discussed in this article, PIM architectures.

For EPCM applications, the high-resistance amorphous state 136 is used to represent a binary 0, and the low-resistance crys-137 talline state is used to represent a binary 1. This nonvolatile 138 change in resistance allows the PCM cell to be paired with 139 an access transistor to form a 1T1R EPCM memory cell and



Fig. 1. OPCM memory cells proposed in (a) COSMOS [31], (b) Photonic tensor core [15], and (c) COMET [23]. WG: waveguide; DC: directional coupler; MR: Microring resonator.

a corresponding memory array of these cells, as described in <sup>140</sup> many prior works (e.g., [26], [27], [28], and [29]). However <sup>141</sup> as discussed earlier, EPCM memories face many challenges, <sup>142</sup> such as asymmetric and high write latencies [30], nonlinear <sup>143</sup> response to write voltages, and resistance drift. <sup>144</sup>

OPCM memories rely on shifts in the material's refractive 145 index to modulate optical transmission, enabling data storage 146 and retrieval [24]. A deep understanding of a PCM's optical 147 properties is crucial for the effective deployment of OPCM 148 memories. A significant refractive index contrast, ensuring 149 a clear distinction in optical transmission between phases, 150 is vital for reducing optical signal losses and noise [25], 151 which could otherwise lead to readout errors. Similar to the 152 importance of resistance contrast in EPCM memories, a high 153 refractive index contrast improves the signal-to-noise ratio 154 (SNR) during data readout. This is extremely important not 155 just from a data fidelity standpoint but also from a photonic 156 PIM standpoint, as we must ensure error-free data readouts 157 to ensure error-free calculations in the analog domain where 158 photonic computations occur. 159

# B. OPCM Memory

A main memory architecture should have the ability to 161 store large amounts of addressable data, which can be 162 effectively retrieved and modified, whenever needed by the 163 computing system. DRAMs achieve this by having row- 164 and column-addressable memory cells, arranged into mats 165 of cells, which in turn get organized into subarrays, and 166 then banks. Collections of banks form memory chips, which 167 are arranged into dual in-line memory modules (DIMMs) or 168 3-D high bandwidth memory (HBM) architectures. Modern 169 memory addressing schemes and memory controllers expect 170 this style of data storage and management to be interfaced 171 with them. So, it is prudent to consider a similar style of data 172 storage with OPCM memory as well. A few recent works 173 have tackled the challenge of building an addressable OPCM 174 memory [23], [31], which can be used for the DRAM-like 175 memory organization described above. 176

160

The work in [31] introduced a straightforward design for a 177 crossbar-based cell, illustrated in Fig. 1(a), in which the OPCM 178 is strategically positioned atop waveguide intersections. This 179 cell design underpins the core of a main memory architecture 180 called COSMOS. In this COSMOS OPCM memory, the 181 mechanism for accessing data is facilitated by specific row 182 and column access signals that operate on distinct optical 183 wavelengths. These signals are required to be activated 184 simultaneously to enable successful write operations within 185 186 the memory structure. COSMOS also adopts a subtractive read 187 technique. This method involves initially performing a read <sup>188</sup> operation across an entire subarray. Subsequently, a reset signal 189 is dispatched specifically to the row selected for reading, which 190 clears its contents. Following this reset, the subarray undergoes <sup>191</sup> another read operation. By executing this sequential reading <sup>192</sup> and resetting process, it is possible to extract the data from <sup>193</sup> the intended row. The two obtained readouts are subsequently processed through subtraction at the memory controller. This 194 <sup>195</sup> intricate process, when combined with the assumption that each 196 cell can store up to 4 bits of information, significantly amplifies <sup>197</sup> the bit density achievable by this architecture, presenting a <sup>198</sup> substantial advancement in memory design aimed at enhancing <sup>199</sup> data storage efficiency and capacity. However, this architecture 200 is inherently susceptible to optical crosstalk as the data storage mechanisms end up interfering with one another. It is especially 201 202 susceptible to thermal crosstalk from write operations from <sup>203</sup> adjacent rows, especially when multibit storage is assumed, as 204 shown in [23].

The work in [15] showcased an OPCM cell, originally devised for photonic tensor core operation, but deserves memory-based ML acceleration work. The architecture has a crossbar structure to allow signals from orthogonal directions to interact with each other, enabling a wavelength-division multiplexing (WDM)-based broadcast and weight computation technique [33]. The OPCM cell itself, however, is placed away from the waveguide crossing and can interact with a wavelength propagating along the horizontal waveguide. So, in effect, each OPCM cell in [15] performs

<sup>216</sup> 
$$W_{\text{cell}} \times \left[ \frac{\{A, \lambda_1\}}{n} + \frac{\{A, \lambda_2\}}{n} + \dots + \frac{\{A, \lambda_n\}}{n} \right] = W_{\text{cell}} \times A$$
<sup>217</sup> (1)

<sup>218</sup> where,  $W_{cell}$  is the weight stored in the OPCM, *A* is the <sup>219</sup> activation value imprinted onto the wavelength  $\lambda_i$ , *n* is the <sup>220</sup> WDM degree (i.e., the number of wavelengths in the WDM <sup>221</sup> batch) that corresponds to the number of cells per row. This <sup>222</sup> operation makes it an excellent MVM engine, with low latency <sup>223</sup> and energy-efficient operation. In addition, this cell [Fig. 1(b)] <sup>224</sup> is compact and solves the interference and crosstalk issues that <sup>225</sup> plague the COSMOS architecture [31] discussed earlier and <sup>226</sup> would appear to be a good candidate for an OPCM-based PIM. <sup>227</sup> However, the architecture is not column addressable, making <sup>228</sup> it not a good choice for memory architecture. To consider this <sup>229</sup> cell for a memory architecture and then a PIM architecture, <sup>230</sup> column addressability to cells is essential.

To address these issues, the work in [23], COMET, designed a row and column addressable OPCM memory cell [Fig. 1(c)], which is also isolated from other cells to avoid data corruption due to crosstalk. This memory cell makes use of GST for at a storage, with two MRs acting as the access control, electro-optically. The MRs are electrically tunable using a PN junction and are hence active when they are in resonance through the vertical waveguide on the left to access the OPCM through the vertical waveguide on the signal and is passed to the readout waveguide on the right [Fig. 1(c)]. While the proposed cell is not as compact as the one suggested in [15], it <sup>242</sup> offers more reliable data readouts, without crosstalk-induced <sup>243</sup> errors. Further, the GST in the cell was designed to allow for <sup>244</sup> improved energy efficiency in write operations. The subarray <sup>245</sup> architecture also had provisions to ensure loss correction <sup>246</sup> through intermittent semiconductor optical amplifier (SOA) <sup>247</sup> arrays. There are several desirable characteristics that make <sup>248</sup> COMET a suitable backbone for a PIM architecture, but there <sup>249</sup> are also several challenges, as will be discussed in Section III. <sup>250</sup>

## C. Photonic Computation

The previous subsection discussed the characteristics <sup>252</sup> required to realize an OPCM main memory. In this subsec- <sup>253</sup> tion we discuss principles of photonic computation, which are <sup>254</sup> a precursor to realizing a PIM solution with OPCM memory. <sup>255</sup>

Photonic computation can be performed through either <sup>256</sup> coherent or noncoherent analog computation methods [19]. <sup>257</sup> Coherent photonic computation utilizes the phase of light <sup>258</sup> waves in a controlled manner, enabling the encoding and <sup>259</sup> manipulation (e.g., multiplication) of data via interference <sup>260</sup> patterns. This approach takes advantage of the coherent properties of light, such as phase coherence and superposition, to <sup>262</sup> perform complex mathematical operations rapidly and with <sup>263</sup> high precision. Computing architectures that leverage coherent <sup>264</sup> computing often make use of Mach–Zehnder interferometers <sup>265</sup> (MZIs) for data manipulation through constructive or destructive interference with a single wavelength. <sup>267</sup>

Noncoherent photonic computation, on the other hand, does 268 not rely on the phase information of light, conventionally [33]. 269 Instead, it involves manipulation of the intensity or amplitude 270 of light waves to perform computations, making it less sen- 271 sitive to phase fluctuations and coherence issues that might 272 affect coherent systems. Noncoherent approaches are simpler 273 in terms of data encoding and more robust as they do not have 274 as many noise sources to deal with. This makes them suitable 275 for a wide range of applications that require optical signal 276 processing, such as image processing and sensor data analysis, 277 and fundamental arithmetic operations. In addition, they allow 278 performing arithmetic operations at a very large scale, through 279 the usage of WDM, making noncoherent photonics an attrac- 280 tive option for MVM and general matrix multiply (GEMM) 281 operations. To leverage WDM signals, the photonic device 282 used in noncoherent computation systems must be wavelength 283 sensitive, which makes wavelength selective MRs popular 284 candidates for the fundamental devices in these architectures. 285

An MR is an on-chip optical resonator, which resonates <sup>286</sup> when it encounters an optical wavelength that matches its <sup>287</sup> resonant wavelength ( $\lambda_{MR}$ ). Through tuning mechanisms, <sup>288</sup>  $\lambda_{MR}$  can be altered, increasing losses to the encountered <sup>289</sup> wavelength, thus enabling amplitude modulation, and hence <sup>290</sup> forming the basis for noncoherent computation. There are two <sup>291</sup> main tuning mechanisms used: 1) thermo-optic (TO) tuning <sup>292</sup> and 2) electro-optic (EO) tuning. Both these mechanisms can <sup>293</sup> change the effective refractive index ( $n_{eff}$ ) of the bulk of the <sup>294</sup> MR, thereby affecting ( $\lambda_{MR} = 2\pi n_{eff}R$ ; R=MR radius). TO <sup>295</sup> tuning achieves this by heating the MR through microheaters, <sup>296</sup> and EO tuning achieves the same through free carrier injection <sup>297</sup> via a PN junction fabricated across the MR [19]. <sup>298</sup>

Several noncoherent computation architecture in prior 299 300 work [20], [21], [22] rely on MR operation for high 301 throughput, reliable, low energy ML inference accelera-302 tion, through the computation technique called broadcast 303 and weight (B&W) [33]. Here, MRs are tuned to reflect stationary matrix, and vectors are introduced either as 304 a 305 amplitude-modulated wavelengths or via a subsequent array 306 of tunable MRs downstream from the initial MR array's 307 output. The interaction of light with the MRs modifies its 308 amplitude to reflect a multiplication operation. Several of 309 these light signals can be summed using a photodetector  $_{310}$  (PD), achieving *n* multiply and accumulate (MAC) operations simultaneously. Here, n is the WDM degree of the signal and 311 <sup>312</sup> should correspond to the size of the MR array.

From the discussions in Section II-B, the OPCM memory real cell in Fig. 1(c) is a potential candidate to be part of noncoherent architectures that perform computation operations. The OPCM cells can represent the stationary matrix/vector element, while the incoming light signal or one of the access represent the changing vector. At this point, performing a memory read operation through the OPCM cell achieve a multiplication operation. However, to achieve effective large-scale noncoherent computation via PIM, several challenges must be addressed, as discussed in the next section.

## 323 III. REARCHITECTING OPCM MAIN MEMORY FOR PIM

In this section, we take a brief look at the COMET OPCM main memory architecture and why it cannot be used as is for effective noncoherent computation within a PIM solution.

The basic architectural component of the COMET main 327 328 memory architecture is the OPCM memory cell depicted 329 in Fig. 1(c). This memory cell is tiled to form an array, 330 where each cell can be isolated from each other, while access enabled through a wavelength assigned per column of is 331 332 the memory cells in the array. Row access is provided by turning on the access control MRs through EO tuning, thereby 333 allowing the light signals access to the OPCM cell.  $N \times N$  of these cells can form a subarray and  $S \times S$  of these subarrays 335 336 form a memory bank. A collection of B memory banks 337 constitute the main memory.

There are four major challenges that must be overcome to adapt the COMET OPCM memory architecture for PIM.

1) Accessing all the cells in the same row across subarrays 340 and banks requires  $B \times S \times N$  wavelengths, which 341 would be too energy- and power-expensive for a main 342 memory of any reasonable size. During data read/write 343 operations, the light signals are given access only to the 344 subarray in which the corresponding row resides. This 345 is achieved through the usage of GST-based waveguide 346 switching, rather than splitting the WDM signal into 347 multiple subarrays unnecessarily. It should be noted that 348 using optical splitters and couplers would essentially 349 multiply the laser power needed, and this must be 350 avoided. 351

2) COMET was architected to enable a power consumption of under 10 W for the main memory operation.
 This power constraint allows it to operate in a similar

power point to electronic main memory architectures, <sup>355</sup> such as DDR5. However, from a PIM perspective, <sup>356</sup> these choices pose a problem. Having limited access <sup>357</sup> to subarrays, and hence OPCM cells, per read/write <sup>358</sup> operation severely limits the achievable parallelization <sup>359</sup> of computation operations. So, it is necessary to find <sup>360</sup> a solution that enables multisubarray access, without <sup>361</sup> disrupting the optical main memory operation. Note that <sup>362</sup> we cannot rely on increasing WDM degree or splitting <sup>363</sup> signals from the source across multiple subarrays, as this <sup>364</sup> will incur power consumption over the 10 W constraint, <sup>365</sup> reflecting the previous challenge. <sup>366</sup>

- Optical signals can interact with each other in the <sup>367</sup> readout waveguides. Increasing the WDM degree to <sup>368</sup> avoid using splitters carries with it the risk of increased <sup>369</sup> crosstalk and errors, especially when using OPCM cells <sup>370</sup> at higher bit densities. So, careful orchestration of access <sup>371</sup> and readout is necessary to achieve reliable and error- <sup>372</sup> free computations. <sup>373</sup>
- 4) It is also important to consider the impact of bit density <sup>374</sup> per cell on PIM operations. In COMET, a 4-bit per cell <sup>375</sup> bit density was considered to ensure reliable memory <sup>376</sup> operation. This limits possible neural network parameter <sup>377</sup> sizes to 4-bit if there is a need to perform one-shot <sup>378</sup> operations (e.g., multiplications) as discussed at the end <sup>379</sup> of Section II. Careful architectural considerations are <sup>380</sup> needed to handle higher parameter sizes for computation <sup>381</sup> within COMET.

In summary, there are several challenges associated with <sup>383</sup> enabling PIM within an OPCM main memory. In our proposed <sup>384</sup> OPIMA architecture, described in the next section, we address <sup>385</sup> all these challenges via novel and significant alterations to <sup>386</sup> an OPCM main memory architecture, to enable PIM within <sup>387</sup> the memory platform, while still allowing it to retain its core <sup>388</sup> functionality as a main memory solution. <sup>389</sup>

## IV. OPIMA ARCHITECTURE 390

393

This section discusses the proposed OPIMA architecture 391 and how it achieves PIM-based ML acceleration. 392

## A. Maximizing OPCM Memory Cell Efficiency

The OPIMA architecture is a PIM architecture that significantly expands the capabilities of the COMET main memory architecture. COMET explored how effective refractive index  $(n_{\text{eff}})$  and optical absorption ( $\kappa$ ) can be optimized for maximum energy efficiency in OPCM cells. Based on this analysis, 398 the authors had selected GST as the best suited OPCM material for the C-band of frequencies. 400

In this work, we consider more detailed factors influencing 401 the behavior of OPCM-based memory cells, particularly the 402 unwanted changes in the optical transmission of the cells 403 because of the scattering and back reflection of light when 404 interacting with PCMs. The refractive index of the PCMs in 405 crystalline and amorphous states is significantly higher than 406 the refractive index of the waveguide material. Therefore, 407 the propagating light can be scattered and reflected within 408 the waveguide when interacting with the PCM on top of 409



Fig. 2. Design-space exploration of GST-based OPCM memory cell. (a) Optical transmission changes due to scattering and back reflections of the light  $(\Delta T_s)$  in the crystalline state. (b)  $\Delta T_s$  in the amorphous state. (c) Optical transmission contrast between amorphous and crystalline states ( $\Delta T$ ). Observe that for the chosen design point (highlighted with X), the  $\Delta T_s$  for both crystalline and amorphous states is less than 5% while the  $\Delta T$  is at its maximum with 96%.

410 the waveguide. Such a scattering effect leads to unwanted 411 optical transmission changes at the output of the OPCM 412 memory cell.

To tackle this limitation, we performed a design-space 413 414 exploration using GST on top of silicon waveguide to select the 415 most optimal geometry that offers minimal transmission change 416 due to light scattering and maximum transmission contrast due <sup>417</sup> to phase change. To capture the optimal design with minimized <sup>418</sup> scattering of the light, we use the following model:

$$T_{\rm out} = T_{\rm in} - \Delta T_s - P_{\rm abs} \tag{2}$$

420 where  $T_{out}$  is the output transmission of the cell,  $T_{in}$  is the input <sup>421</sup> power,  $\Delta T_s$  is the optical transmission change due to light 422 scattering and back reflections, and  $P_{abs}$  is the total fraction 423 of the power that is absorbed in the PCM cell (all in dB). <sup>424</sup> We perform a design-space exploration of the PCM memory <sup>425</sup> cell to minimize  $\Delta T_s$  to minimize read errors stemming from 426 the scattering effect of the light. For maximizing data signal <sup>427</sup> strength,  $\Delta T_s$  must be minimized so that the signal change <sup>428</sup> due to written data ( $P_{abs}$ ) is well represented in  $T_{out}$ 

$$T_{\text{out}} = (T_{\text{in}} - P_{\text{abs}}) \rightarrow \Delta T_s = 0.$$
(3)

This model is applicable to both amorphous and crystalline 430 states of the cell. In addition, the desired OPCM memory cell 431 432 should offer: 1) high optical transmission which originates from the low power absorption in the amorphous state and 433 434 2) high absorption and hence low optical transmission in the crystalline state. Consequently, the optimum design point 435 should offer minimized light scattering and back reflections 436 both crystalline and amorphous states while leveraging 437 438 the high controlled optical transmission contrast. Therefore, 439 the  $\Delta T_s$  and the total optical transmission contrast between 440 amorphous and crystalline states ( $\Delta T = T_a - T_c$ ) can be used 441 as a figure-of-merit to find the optimal design for the GST-442 based OPCM memory cell. This optimal design should offer <sup>443</sup> a low  $\Delta T_s$  in the amorphous and crystalline state and a high 444 optical transmission contrast  $(\Delta T)$  between amorphous and 445 crystalline states.

The design space exploration results for a 2- $\mu$ m long 446 447 GST cell that we designed are reported in Fig. 2. Observe 448 that for the design point which offers the highest optical <sup>449</sup> transmission contrast ( $\Delta T$ ) highlighted in Fig. 2(c), the trans-450 mission changes due to light scattering and back reflections

**E-O-E Controller** ΟΡΙΜΑ Multi-mode input waveguide Write data encodir control ensitiv Address decoding MR array data ccess cont Non-linearity caching signal mentations OPCM Memory Bank Data out CPU Memory interface Mode sensitive interfac MR array ost CPI Optical signals Electrical signals Multi-mode output waveguide

Laser Source

Fig. 3. Architectural overview of OPIMA.

is always less than 5% in the crystalline state [Fig. 2(a)] and 451the amorphous state [Fig. 2(b)]. In addition, GST offers a 452 high controlled optical transmission contrast ( $\sim$ 96%) for the 453 optimal design point shown in Fig. 2(c) which corresponds 454 to a width of 0.48  $\mu$ m and thickness of 20 nm. This higher 455 contrast in transmission also allows us to program 16 levels 456 of transmission per cell, allowing a bit density of 4 bits/cell. 457

The OPCM memory cell that we designed and optimized 458 forms the building block of the OPIMA architecture that is 459 designed for efficient data storage and access, as well as for 460 performing in-situ multiplication operations. For the sake of 461 maintaining row and column addressability, and hence main 462 memory operation, we combine this OPCM memory cell with 463 double MRs for optical access control. 464

## B. OPCM Memory Operation

An overview of how OPIMA is designed to operate as a 466 memory interfaced with an external general-purpose electronic 467 CPU is shown in Fig. 3. A controller unit that handles the 468 electro-optical interfacing requirements must reside between 469 the CPU and OPIMA, as depicted in the figure. This con- 470 troller unit interprets memory commands from the host CPU, 471 enabling main memory operation. It also supports data caching 472 for read data to be sent to the CPU or data to be written to 473 the OPCM memory. In the latter case, the data are encoded 474 via optical signals derived from the laser source. 475

The isolated OPCM cells within OPIMA make read/write 476 operations quite straightforward. For both operations, the row 477 ID and subarray ID must be deciphered from the physical 478 address. Once this has been done, laser signals are sent to 479



Fig. 4. Memory (a) write and (b) read operation in *OPIMA; OPIMA* utilizes multiple read signals simultaneously to perform computation operations. The differences in control flow between a memory read operation and performing in-memory computation are highlighted in (b).

480 the corresponding OPCM bank. The read process [Fig. 4(b)] happens as the signal passes through the memory cell and is 481 482 modulated by the OPCM's optical transmission. The read data are sent back to the E-O-E controller where it is demodulated 483 using an MR array. Then, this data can be translated to the 484 485 electronic domain and passed on to the CPU. The write process [Fig. 4(a)] requires much higher energy as it requires inducing 486 partial phase transition in the OPCM memory cells. This 487 488 necessitates more laser power to achieve the phase transition 489 across multiple OPCM cells, based on the data to be written. During the read and write operations, data integrity is a 490 491 critical concern, especially considering the loss tolerance in 492 signal transmission. OPIMA incorporates SOAs within and 493 outside the banks and subarrays to maintain signal quality. We employ row-wise loss-aware signal amplification to counteract <sup>495</sup> potential degradation. The banks and subarrays, once designed, <sup>496</sup> have constant losses, facilitating this correction approach.

## 497 C. OPIMA PIM Architecture

As discussed earlier, the optical transmission of an OPCM cell modulates the optical signal passing through it. If the access control MR is tuned to represent the second parameter, the successive modulations from the MR and the OPCM can achieve a multiplication operation. However, since we need all the MRs in a row to behave identically to facilitate row access, it is better to tune the incoming laser signal to represent the second parameter. To achieve an accumulate operation, we must let two signals of the same wavelength, modulated to reflect products, interact with each other. To perform this step, we need to involve products from another subarray sharing the same readout waveguide bus. Within the readout waveguide 509 bus, these signals interfering with each other generate the 510 sums. This is desirable from a PIM perspective but will lead to 511 erroneous readouts from a main memory perspective. Hence, 512 for achieving this goal and thus realizing the PIM operations 513 for ML inference acceleration, we need several architectural 514 changes to the main memory architecture, as discussed next. 515

To realize high throughput and error-free PIM operation in 516 OPIMA, we need to address four major challenges: 1) we 517 need to leverage additional mechanisms to increase memory 518 access and computation parallelism beyond those offered by 519 WDM; 2) reads should be supported from a selected subarray 520 or a group of subarrays as needed, without interrupting the 521 main memory operation; 3) when simultaneously read out, the 522 data from computation outputs and main memory accesses 523 must not interfere with each other in an undesirable manner; 524 and 4) the architecture should support PIM operations between 525 parameters (e.g., CNN weights and activations) of any size, 526 irrespective of the specific bit density used in the OPCM cells. 527

1) Implementing MDM for Improved Parallelism: To 528 address challenge 1), within OPIMA, we design the multi- 529 bank OPCM memory organization to go beyond WDM and 530 additionally use mode-division multiplexing (MDM) to enable 531 parallel access across banks [Fig. 5(a)]. MDM involves excit- 532 ing higher order modes in an MDM waveguide bus, where 533 each of the modes of a wavelength can then be used for 534 supporting parallel data transfers and computations. Note that 535 multiple wavelengths co-existing in the waveguide bus (WDM) 536 provide further parallelism for data transfers and computations. 537 Increasing the number of modes comes at the cost of increased 538 width of the individual waveguide to allow the higher order 539 modes to be excited and propagated, as well as increased 540 crosstalk. Thus, determining the optimal number of modes 541 (MDM degree) requires a careful tradeoff analysis. 542

We inverse designed photonic mode convertors based 543 on [34] to exploit the first four modes of TE polarization. 544 Compared to conventional mode convertors based on tapered 545 structures or thickness changes to induce the required index 546 changed, the inverse designed mode convertors offer a compact 547 footprint and minimal loss. Note that exciting more than 548 four modes in the waveguide at the same time is physically 549 challenging as it requires extremely wide waveguides that 550 significantly increase memory area. In addition, higher order 551 modes suffer from intermodal crosstalk due to the overlap 552 of the modes [35], [36]. Based on our MDM propagation 553 analyses, we decided to keep the MDM degree to four, which 554 limits the number of banks in the architecture to four. These 555 MDM signals can be filtered by mode-sensitive MRs to their 556 respective banks and be routed to their respective subarrays 557 through GST switches, enabling parallel read/write operations 558 across banks. However, there is a need to improve parallelism 559 further to achieve higher PIM throughput. In addition, while 560 it is technically possible to perform an MAC operation 561 by reading from two OPCM cells, this operation will be 562 limited to 4-bit parameters under the configuration discussed 563 here. 564

2) Redesigning Banks for Concurrent PIM and Memory 565 Access: A memory bank within the OPIMA architecture is 566



Fig. 5. OPIMA's PIM-specific architecture. (a) OPCM bank organization. (b) Subarray organization within the bank, showcasing grouping, aggregation unit, and computation specific waveguides, coupling MRs, and mode converters (MCs). (c) Subarray group internals; each subarray is equipped with an MDL array for PIM operation independent of main memory operation. (d) Low loss waveguide (wg) crossings designed using inverse design. (e) GST cells used for subarray access control during OPCM main memory operation. (f) OPCM memory cell with EO tuned MRs showcased. (g) OPCM memory array within subarrays, with  $R \times C$  OPCM cells within it.

<sup>567</sup> composed of  $R \times C$  OPCM cells [Fig. 5(g)], offering a total <sup>568</sup> capacity determined by the product of the number of cells <sup>569</sup> and the bit density of each OPCM MLC. To enhance energy <sup>570</sup> efficiency, banks are divided into subarrays. The OPIMA <sup>571</sup> architecture employs electrically controlled GST-based waveg-<sup>572</sup> uide switching to facilitate efficient subarray access [Fig. 5(e)], <sup>573</sup> markedly reducing the laser power requirements. The GST <sup>574</sup> switch introduces minimal losses and is pivotal for the energy-<sup>575</sup> efficient operation of the system. We need to make changes <sup>576</sup> to this organizational structure and provide additional access <sup>577</sup> mechanisms to address challenge 2).

Data within OPCMs cannot be sensed in the same manner 578 579 as charge-based storage in DRAM. Accessing data in OPCM cells necessitates external laser signals, which must overcome 580 several losses in propagation, to be rerouted to the subarrays 581 within which the OPCM cell resides. This leads to high 582 power consumption, to overcome the losses and the signals 583 being split into several destinations. To circumvent this, we 584 <sup>585</sup> propose the addition of local laser sources to subarrays, which 586 can be triggered as needed for reads. Fortunately, unlike 587 OPCM write operations, OPCM read operations are not energy <sup>588</sup> intensive [23] and hence we can employ low-power lasers.

For OPIMA we opted for low-power microdisk laser (MDL) 589 arrays [Fig. 5(c)], which can be integrated with every subarray. 590 Each subarray uses C MDLs in its subarray, reflecting the 591 column number per subarray. The laser output from the MDL 592 array can be coupled onto the signal input waveguide of the 593 corresponding subarray, using directional couplers. Using the 594 MDL arrays, we can access any row within a subarray, without 595 the involvement of the external laser source which drives the 596 main memory operation. In addition, since the MDL arrays are 597 independent of each other, multiple of them can be activated 598 simultaneously to read from multiple subarrays without having 599 600 to reroute or incur additional losses.

Moreover, to ensure that we can read for PIM while main memory operations happen in parallel, the subarrays are divided into several groups [Fig. 5(b)]. One row of subarrays per group can be employed for PIM at a time, while the rest <sup>604</sup> of the subarrays can be used for main memory read/write <sup>605</sup> operations. This ensures significant parallelism in MAC operations that can be executed simultaneously per bank, offering <sup>607</sup> simultaneous solutions to challenges 1) and 2). <sup>608</sup>

*3) Reducing Output Interference:* Now that we have several MAC operations being supported simultaneously, we must ensure that their results can be aggregated without interfering with each other or the main memory readout operations, to address challenge 3). It should be noted that the subarrays make use of WDM signals which can interfere with each other constructively or destructively.

To avoid computation signals interfering with memory 616 read operations, we employ a series of computation-specific 617 waveguides. Computed data are rerouted to the computation 618 waveguides rather than the data-out waveguide using coupling MRs which can be activated alongside the MDL array 620 [Fig. 5(c)]. The computation waveguide is used to move the 621 data to the aggregation unit in the bank. To prevent losses 622 and the computed signal from interfering with orthogonally 623 traveling data signals, all the waveguide crossings in the 624 computation waveguide have been carefully designed to be as 625 leakage-free as possible [Fig. 5(d)].

To achieve the optimized waveguide crossing design, we 627 used a photonic inverse design technique to minimize the 628 loss and crosstalk of the waveguide crossings. The Lumerical 629 FDTD solver [37] with the LumOpt [38] inverse design library 630 was used to perform the geometry optimization of the waveguide crossings. The optimized geometry of the waveguide 632 crossing is shown in Fig. 6. Note that the transmission of the 633 fundamental TE mode was used as a figure-of-merit in our 634 inverse design optimization of waveguide crossing. We can 635 observe from the figure that the inverse-designed waveguide 636 crossing offers the maximum transmission at the C-band with 637 less than 0.001% of the input optical signal being lost due 638 crossing offers minimal -40 dB of the crosstalk in the C-band. 640



Fig. 6. Low-loss waveguide crossing designed with inverse design methodology (left) and its loss profile for C-band (right).

As the data reaches the aggregation unit, they have to be merged. Here again, interference between signals can be an issue. As discussed earlier in this subsection, we can make use of up to four modes without significant crosstalk between the signals. We can reuse the orthogonality of modes here again. Each subarray group can be assigned a mode using a mode converter (MC), before it merges with the waveguide carrying the signals to the aggregation unit's demultiplexer (demux). These changes to the architecture solve challenge 3).

4) Addressing Bit Size Mismatches: OPCM cells within 650 the photonic memory can be designed to have different bit 651 652 densities, e.g., 1 bit/cell, 2 bit/cell, 4 bit/cell, etc. However, the parameters in an ML model like a CNN can be 32 bits 653 654 in size without quantization. They can also be quantized to 655 lower bitwidths, such as 16, 8, or 4 bits to reduce storage 656 requirements and to reduce computation latency and energy. In scenarios where there is a mismatch between OPCM cell bit 657 658 density and the CNN parameter size (e.g., 4 bits/cell bit den-659 sity with 8-bit CNN parameters), the one-shot multiplication 660 operation achieved by reading the OPCM cell, as discussed earlier, is not feasible. 661

To support different bitwidth scenarios and tackle challenge 662 663 4), we make use of a time division multiplexing (TDM)-based <sup>664</sup> approach. For higher bit densities per cell than 4-bits (i.e., a nibble), each nibble will have to interact with every nibble of the other parameter. This can be achieved without significant 666 667 loss in throughput because of solutions for challenges 1)-3) which offer high parallelism in MAC operations, while the 668 669 signals can stay disentangled from each other. However, we 670 still have to perform shift-and-add operations to obtain the 671 true results for these operations [39]. These necessary operations are facilitated within the aggregation unit [Fig. 5(b)]. 672 673 This results in an overall drop in throughput, but facilitates 674 flexibility in operation, unconstrained by the OPCM MLC bit-675 density.

The aggregation unit is essential to address challenge 4), but it also provides some additional benefits. The PD-based conversion to the electrical domain acts as a noise filtering mechanism. The wavelength-specific PDs offer disentanglement from crosstalk between wavelengths, improving SNR before the longer transmission to the E-O-E control unit. In addition, the parameters can be stored within the SRAM cache within the aggregation unit, for additional accumulation event from the operations. Finally, the readout signals for the MAC for operations which were generated using low-power MDLs will be regenerated through DACs and vertical cavity surface 6889 emission lasers (VCSELs) for better fidelity before they reach 6899 the E-O-E controller which handles further aggregation and 6900 applies nonlinear activation functions (see Fig. 3) for ML 691 inference operations. 692

693

## D. CNN Mapping and Inference in OPIMA

The architectural design choices discussed in the previous <sup>694</sup> subsection allow the OPIMA architecture to realize high <sup>695</sup> power consumption efficiency and high integrity large-scale <sup>696</sup> parallel MAC operations and main memory accesses in the <sup>697</sup> optical domain. From a CNN inference perspective, this offers <sup>698</sup> two-fold benefits. First, MAC operations are fundamental <sup>699</sup> operations in CNNs and OPIMA can perform them with <sup>700</sup> high degrees of parallelism. Second, CNNs in general require <sup>701</sup> significant storage and data movement between layers, but this <sup>702</sup> can be significantly reduced as the processing occurs within <sup>703</sup> the memory where model parameters and activation feature <sup>704</sup> maps are stored. <sup>705</sup>

To leverage the parallelism offered by the PIM substrate in 706 OPIMA for CNN inference, we need to efficiently map CNNs 707 onto the OPCM arrays. For CNNs, this involves mapping the 708 parameters from both convolutional layers and fully connected 709 layers. Operations for both types of layers can be mapped into 710 MVM operations. For convolutional layers, we adopt an input 711 stationary dataflow approach, where the input data can stay in 712 its native storage location while we drive the smaller weight 713 matrices (decomposed as vectors) through them. Because of 714 the large row sizes within the subarrays, we will be able to 715 drive several kernels simultaneously. The feature map must be 716 divided across subarrays, so that we can access subsequent 717 rows of the map from neighboring subarrays. The kernels 718 rows which must operate on the feature map can be encoded 719 into laser signals through MDL tuning and be introduced into 720 the subarrays. In addition, we can achieve several parallel 721 MAC operations through in-waveguide interference of WDM 722 signals, from multiple subarrays within the same subarray 723 group. 724

Let us consider a simple example with a  $2 \times 2$  kernel, a 725 feature map (F) with a row size of 4 elements, and MDL 726 array generating wavelengths { $\lambda_1, \lambda_2, \ldots, \lambda_C$ } (C=number 727 of columns per subarray). The kernel can be broken down 728 into two vectors and mapped to MDL wavelengths:  $k_1 = 729$  $\{k_{00}, k_{01}\} \rightarrow \{\lambda_1, \lambda_2\}$  and  $k_2 = \{k_{10}, k_{11}\} \rightarrow \{\lambda_1, \lambda_2\}$ . 730 Similarly the rows in F can be broken down into vectors 731 and mapped to subarrays:  $\{f_{00}, f_{01}, f_{02}, f_{03}\} \rightarrow Subarray_1$  and 732  $\{f_{10}, f_{11}, f_{12}, f_{13}\} \rightarrow Subarray_2$ . Both subarrays must be within 733 the same subarray group to facilitate the MAC operation. If 734 we now enable access to the rows containing these vectors 735 and simultaneously send the  $k_1$  and  $k_2$  signals from the 736 MDLs through the subarrays, we shall obtain the following 737 in the common readout waveguide bus { $(k_{00} \times f_{00}, k_{10} \times f_{38})$  $f_{10}$ ,  $\lambda_1$ , {( $k_{01} \times f_{01}$ ,  $k_{11} \times f_{11}$ ),  $\lambda_2$ }. 739

Because signals of the same  $\lambda_i$  interfere with each other, <sup>740</sup> this in turn generates:  $(k_{00} \times f_{00} + k_{10} \times f_{10})$ ,  $(k_{01} \times f_{01} + {}^{741} k_{11} \times f_{11})$ , which is one addition away from generating the <sup>742</sup> first element of an output feature map. This addition can be <sup>743</sup>

TABLE I OPTICAL LOSS AND POWER PARAMETERS CONSIDERED FOR OPIMA

| Loss parameters                                                                                           | Values                                                                                   |  |  |
|-----------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|--|--|
| Directional coupler loss                                                                                  | 0.02 dB [42]                                                                             |  |  |
| MR drop loss                                                                                              | 0.5 dB [43]                                                                              |  |  |
| MR through loss                                                                                           | 0.02 dB [44]                                                                             |  |  |
| Propagation loss                                                                                          | 0.1 dB/cm [45]                                                                           |  |  |
| Bending loss                                                                                              | 0.01 dB/90° [46]                                                                         |  |  |
| EO tuned MR drop loss                                                                                     | 1.6 dB [47]                                                                              |  |  |
| EO tuned MR through loss                                                                                  | 0.33 dB [47]                                                                             |  |  |
| SOA gain                                                                                                  | 20 dB                                                                                    |  |  |
|                                                                                                           | -                                                                                        |  |  |
|                                                                                                           |                                                                                          |  |  |
| Energy parameters                                                                                         | Values                                                                                   |  |  |
| Energy parameters<br>OPCM read                                                                            | Values     5 pJ [23]                                                                     |  |  |
| Energy parameters           OPCM read           OPCM write                                                | Values           5 pJ [23]           250 pJ [23]                                         |  |  |
| Energy parameters<br>OPCM read<br>OPCM write<br>EPCM write                                                | Values<br>5 pJ [23]<br>250 pJ [23]<br>860 nJ [48]                                        |  |  |
| Energy parameters OPCM read OPCM write EPCM write DRAM access                                             | Values<br>5 pJ [23]<br>250 pJ [23]<br>860 nJ [48]<br>20 pJ/bit [49]                      |  |  |
| Energy parameters         OPCM read         OPCM write         EPCM write         DRAM access         ADC | Values<br>5 pJ [23]<br>250 pJ [23]<br>860 nJ [48]<br>20 pJ/bit [49]<br>24.4 fJ/step [50] |  |  |

<sup>744</sup> performed at the aggregation unit. The kernel can be moved <sup>745</sup> across the MDL array to reflect the stride operation and further <sup>746</sup> outputs can be obtained. In addition, multiple kernels can be <sup>747</sup> deployed simultaneously over *F*, across different wavelengths, <sup>748</sup> reducing overall processing time requirement. This mapping <sup>749</sup> process scales easily with kernel sizes as well, if the kernel <sup>750</sup> sizes do not exceed the subarray row size.

For fully connected layers we opt for a weight-stationary r52 approach. In both cases, the stationary matrix must be disr53 tributed across subarrays to ensure parallelism in operations. r54 Once this mapping process is done, OPIMA's PIM-specific r55 architecture (Fig. 5), as described in this section, can be r56 utilized effectively to achieve inference operation.

#### 757

## V. EXPERIMENTS

In this section, we discuss the evaluation of the performance OPIMA for PIM-based CNN inference acceleration. OPIMA adopts a main memory configuration of 4 banks, 64×64 subarrays per bank, with 256×512 OPCM elements rel and 256 MDLs per subarray. For evaluating OPIMA we rely on a modified NVMain 2.0 [61] for memory simulation red followed by a Python-based performance analyzer, which makes use of the loss and energy parameters from detailed physics simulations and fabricated device characterizations reg summarized in Table I.

We compare OPIMA against several electronic and optical acceleration platforms along with the current state-of-the-art photonic PIM. For photonic accelerator systems, we consider the work in [32], named PhPIM in our comparison studies, which proposed a PIM adjacent system, and CrossLight [41], a photonic CNN accelerator. CrossLight and PhPIM are modeled using the parameters in Table I, and considering 8GB DDR5 DRAM, with 4800 megatransfers per second (MTS) data transfer rate as its main memory.

We also consider Nvidia P100 GPU (referred to as NP100 r78 in results), AMD EPYC 7742 CPU (referred to as E7742 r79 in results), and Nvidia Jetson ORIN (a low-power embedded r80 GPU for edge AI applications; referred to as ORIN in r81 results), as our electronic platform comparison points. In r82 addition, we consider the ReRAM-based PIM CNN accelerator r83 PRIME [11] for comparison.



Fig. 7. Subarray group selection for OPIMA architecture.

#### A. Subarray Grouping

The first experiment explores the OPIMA design space to 785 determine the number of subarray groups, which in turn will 786 determine the number of operations that can be performed per 787 cycle, in OPIMA. This increase in parallelism trades off with 788 the power consumption of the architecture. As the number 789 of groups increases, the complexity of the interface required 790 at the aggregation unit also increases, along with the laser 791 power requirement to perform the operations. Simultaneously, 792 we would like the maximum number of subarray rows to be 793 accessible for main memory operations.

The OPIMA memory organization has 64 rows of subarrays <sup>795</sup> per bank as mentioned earlier, which must be grouped as per <sup>796</sup> the criteria discussed above. While considering the groups, <sup>797</sup> we would like to avoid the extremes, i.e., the case with a <sup>798</sup> single group or the case with each subarray row belonging to <sup>799</sup> an individual group, resulting in 64 groups. A single group <sup>800</sup> severely limits parallelism, and 64 groups imply that all 64 <sup>801</sup> rows will be engaged in PIM operations, essentially preventing <sup>802</sup> any main memory read/write operations. <sup>803</sup>

Fig. 7 shows the normalized power, MAC throughput, and rows available for main memory operation, with changing number of subarray groups (*x*-axis). It can be observed that a configuration with 16 groups strikes a balance between achievable compute parallelism with reasonable power consumption and sufficient memory access without starvation. In addition, 16 subarray groups enable the maximum throughput efficiency (MAC/Watt) from OPIMA.

Our earlier analysis on mode conversion pointed to the fact <sup>812</sup> that we can only have a maximum of four modes in our <sup>813</sup> waveguide at the aggregation unit. Since we must rely on four <sup>814</sup> modes only, to meet the demand of 16 groups, the modes <sup>815</sup> can be reused. For enabling mode reuse, we use the same <sup>816</sup> mode converter designs along the computation waveguides <sup>817</sup> [Fig. 5(b)]. In addition, to prevent the same modes from <sup>818</sup> interacting with each other, each of the four modes is assigned <sup>819</sup> a separate multimode waveguide for transferring to the demux <sup>820</sup> unit within the aggregation unit.

## B. OPIMA Power Breakdown

The power consumption breakdown of the resulting version 823 OPIMA is shown in Fig. 8. From this plot we can observe 824 that the maximum power consumption is contributed by the 825 MDL array and the electrical-optical interface, leading to 826

784

TABLE II VARIOUS MODELS CONSIDERED FOR OPIMA EVALUATION AND THEIR ACCURACY ACROSS QUANTIZATION LEVELS FOR CLASSIFYING THE SPECIFIED DATASETS

| Model       | Dataset         | Accuracy (fp32) | Accuracy (int8) | Accuracy (int4) | Parameter count     |
|-------------|-----------------|-----------------|-----------------|-----------------|---------------------|
| Resnet18    | CIFAR100 [57]   | 75.3%           | 74.2%           | 72.6%           | 11584865 (11.6 M)   |
| InceptionV2 | SVHN [58]       | 81.5%           | 80.8%           | 75.9%           | 2661960 (2.6 M)     |
| MobileNet   | CIFAR10 [57]    | 88.2%           | 87.5%           | 83.5%           | 4209088 (4.2 M)     |
| SqueezeNet  | STL-10 [59]     | 92.5%           | 90.3%           | 86.5%           | 1159848 (1.1 M)     |
| VGG16       | Imagenette [60] | 98.96%          | 96.25%          | 93.7%           | 134268738 (134.3 M) |



Fig. 8. Power breakdown for OPIMA architecture.

<sup>827</sup> a maximum power consumption of 55.9 W, for both main <sup>828</sup> memory and PIM operations running simultaneously.

## 829 C. CNN Workload Accuracy and Latency Analyses

For workloads we considered four CNN models: 1) 830 831 ResNet18 [53]; 2) InceptionV2 [54]; 2) MobileNet [55]; and <sup>832</sup> 4) SqueezeNet [56]. The inference is performed for image 833 classification of datasets, details of which are provided in <sup>834</sup> Table II. We have considered 4-bit integer quantization using 835 TensorRT, as this is the baseline MLC capacity. As the table 836 shows this level of quantization results in at most 6% loss <sup>837</sup> in accuracy, in the considered models. But this accuracy drop model architecture-dependent, as can be seen in Table II. 838 iS To showcase OPIMA's flexibility in handling parameter sizes, 839 we have also considered 8-bit variants of the same models 840 (Table II). 841

Before we go into further comparisons, we first analyze 842 843 the performance of OPIMA using both the 4-bit and 8-bit 844 quantized variants of the CNN models. A breakdown of 845 OPIMA's latency in ms, as it processes these models, is 846 provided in Fig. 9. Processing latency is the total time for <sup>847</sup> processing the necessary MAC operations and the aggregation 848 unit operation, i.e., all in-memory processing operations. The writeback latency refers to the latency incurred while applying 849 850 the nonlinearities and writing back the results, i.e., output <sup>851</sup> feature maps, back into OPIMA's main memory architecture. It can be observed that writeback is a significant contributor 852 853 to latency in OPIMA. The PIM operations can leverage 854 data within the memory and the high processing parallelism, 855 leading to remarkably low processing times. However, the 856 latency for the OPCM write operations needed to make the 857 output feature maps available within the memory for further 858 processing far outweighs the latency savings from the PIM 859 operations. So, even though OPIMA can handle a variety of <sup>860</sup> parameter sizes, given the OPCM write latencies, it is prudent



Fig. 9. Latency breakdown for OPIMA's 4-bit (4b) and 8-bit (8b) variants across the models from Table II.

to rely on 4-bit quantized models, while suffering some loss 861 in accuracy, if throughput is significantly more important. 862

It can also be observed that OPIMA does not perform 863 as one would expect for the far smaller InceptionV2 and 864 MobileNet models when compared to ResNet18. Both models 865 have higher processing latencies, with MobileNet having 866 significantly higher processing latency than ResNet18. This 867 is attributed to the  $1 \times 1$  kernel in these models, which pose 868 problems for the WDM-based MAC parallelization within 869 OPIMA. Since the results from these operations do not have 870 any further accumulation to be performed on them, they 871 prevent the totality of the subarray row from being used. If 872 more operations are performed, they will interfere with the 873 results from the 1×1 kernel, leading to erroneous results. So, 874 when these are encountered, OPIMA loses a significant portion 875 of its parallel processing capabilities, especially when they are 876 sequential in the CNN execution graph, like in the case of 877 InceptionV2. MobileNet, though a larger model, offers higher 878 parallelization opportunities, and hence performs at a similar 879 latency, despite being  $\sim 4 \times$  the size of InceptionV2. 880

Similarly, writeback is a significant contributor to overall <sup>881</sup> latency as discussed earlier. However, this is proportional to <sup>882</sup> the sizes of the output feature maps generated by the model <sup>883</sup> and not the computational complexity of the model. This is the <sup>884</sup> reason MobileNet has lower writeback latency than processing <sup>885</sup> latency, in comparison, and why InceptionV2 has an overall <sup>886</sup> lower latency than ResNet18. <sup>887</sup>

To further characterize the latency benefits of OPIMA, <sup>888</sup> we compare it against the latency for the other photonic <sup>889</sup> computing architectures we have considered, as shown in <sup>890</sup> Fig. 10. The OPCM-based architectures (OPIMA, PhPIM) <sup>891</sup> have better performance than CrossLight, because of the <sup>892</sup> higher parallelism achievable in these architectures. PhPIM <sup>893</sup> leverages the photonic tensor core operation from [15], along <sup>894</sup> with an external DRAM acting as the actual main memory. <sup>895</sup>



Fig. 10. Latency breakdown of CNN model inference across photonic architectures OPIMA (O), CrossLight (C), and PhPIM (P), for model-dataset pairs from Table II.



Fig. 11. EPB comparison across architectures.

896 PhPIM has opted for the faster yet energy-intensive electrical PCM programming mechanism, but the tensor core operation 897 <sup>898</sup> is still in the optical domain. The reprogramming, or writeback as we call it for an OPCM PIM, is significantly faster for 899 900 PhPIM. However, OPIMA leverages much higher parallelism <sup>901</sup> inherent to a main memory, and available to a PIM archi-<sup>902</sup> tecture, enabling faster processing times. In addition, OPIMA 903 does not have to access an external DRAM to access data <sup>904</sup> needed for processing hence it does not have any external 905 data movement latencies associated with its operation. Note 906 that the internal data movement latencies are factored into our writeback latency. 907

## 908 D. Comparison Studies

<sup>909</sup> In this section, we compare OPIMA against the various <sup>910</sup> photonic and electronic acceleration platforms in terms of <sup>911</sup> energy per bit (EPB) and throughput efficiency (FPS/W; <sup>912</sup> FPS=frames per second).

On average OPIMA achieves  $78.3 \times$ ,  $157.5 \times$ ,  $1.7 \times$ ,  $4.4 \times$ , 914 2.2× and 137× better performance in terms of EPB over 915 NP100, E7742, ORIN, PRIME, CrossLight, and PhPIM, 916 respectively, (Fig. 11). It should be noted that P100 can 917 outperform OPIMA in terms of raw throughput, especially 918 in the case of InceptionV2 and MobileNet, where the GPU 919 threads are not constrained by the interference limitations of 920 our WDM-based parallelization of operations. But OPIMA 921 consumes significantly less power, which also leads to overall 922 better throughput efficiency. In terms of FPS/W, OPIMA 923 achieves  $6.7 \times$ ,  $15.2 \times$ ,  $8.2 \times$ ,  $5.7 \times$ ,  $1.8 \times$ , and  $11.9 \times$  better 924 performance over NP100, E7742, ORIN, PRIME, CrossLight, 925 and PhPIM, respectively, (Fig. 12).



Fig. 12. FPS/W comparison across architectures.

It can also be noted that though OPIMA and PhPIM had <sup>926</sup> comparable latencies (Fig. 10), OPIMA is able to outperform <sup>927</sup> PhPIM in these metrics. This is because of the energy- <sup>928</sup> intensive EPCM write processes that accompany PhPIM <sup>929</sup> operation (nJ), as opposed to OPIMA's OPCM reprogramming <sup>930</sup> process (pJ).

In this work, we presented OPIMA, a high throughput, 933 low latency, highly energy efficient OPCM PIM architecture. 934 OPIMA showcases how an OPCM main memory architecture 935 can be rearchitected to achieve photonic PIM. Through device- 936 level design to enhance efficiency and various architectural 937 innovations, OPIMA compares remarkably against electronic 938 and photonic ML acceleration platforms. On average OPIMA 939 outperforms the considered architectures by  $83.1 \times$  in terms 940 of EPB and  $27.5 \times$  in terms of FPS/W. It outperforms the <sub>941</sub> state-of-the-art photonic PIM architecture PhPIM by 186× 942 and 55.3× in these metrics, while achieving lower average 943 latency, across several CNN models. OPIMA also opens the 944 door for possible system-level integration of photonic PIM 945 with dedicated photonic accelerators, such as those described 946 in [20], [21], [22], and [41]. Such a system can benefit from 947 both the higher bandwidth that OPIMA's main memory can 948 provide along with computation support through PIM. 949

#### REFERENCES

- N. Patwardhan, S. Marrone, and C. Sansone, "Transformers in the real 951 world: A survey on NLP applications," *Information*, vol. 14, no. 4, 952 p. 242, 2023.
- [2] Q. An, S. Rahman, J. Zhou, and J. J. Kang, "A comprehensive review 954 on machine learning in healthcare industry: Classification, restrictions, 955 opportunities and challenges," *Sensors*, vol. 23, no. 4, p. 4178, 2023. 956
- [3] Z. Cao, K. Jiang, W. Zhou, S. Xu, H. Peng, and D. Yang, "Continuous 957 improvement of self-driving cars using dynamic confidence-aware reinforcement learning," *Nat. Mach. Intell.*, vol. 5, pp. 145–158, Feb. 2023. 959
- [4] S. Han, H. Mao, and W. J. Dally, "Deep compression: Compressing 960 deep neural networks with pruning, trained quantization and Huffman 961 coding," in *Proc. ICLR* 2016, pp. 1–14. 962
- [5] A. Gholami, Z. Yao, S. Kim, C. Hooper, M. W. Mahoney, and 963 K. Keutzer, "AI and memory wall," 2024, arXiv:2403.14123.
- [6] S.-L. Lu, T. Karnik, G. Srinivasa, K.-Y. Chao, D. Carmean, and 965 J. Held, "Scaling the memory wall," in *Proc. IEEE/ACM ICCAD*, 2012, 966 pp. 271–272. 967
- [7] K. Khan, S. Pasricha, and R. G. Kim, "A survey of resource management for processing-in-memory and near-memory processing architectures," 969
   J. Low Power Electron. Appl., vol. 10, no. 4, p. 30, Sep. 2020. 970
- [8] M. He et al., "Newton: A DRAM-maker's accelerator-in-memory (AiM) 971 architecture for machine learning," in *Proc. IEEE/ACM MICRO*, 2020, 972 pp. 372–385.

- 974 [9] S. Roy, M. Ali, and A. Raghunathan, "PIM-DRAM: Accelerating
- machine learning workloads using processing in commodity DRAM," *IEEE J. Emerg. Sel. Topics Circuits Syst.*, vol. 11, no. 4, pp. 701–710,
  Dec. 2021.
- 978 [10] S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y. Xie, "DRISA:
   979 A DRAM-based reconfigurable in-situ accelerator," in *Proc. IEEE/ACM* 980 *MICRO*, 2017, pp. 288–301.
- 981 [11] P. Chi et al., "PRIME: A novel processing-in-memory architecture for neural network computation in ReRam-based main memory," in *Proc. ISCA*, 2016, pp. 27–39.
- 984 [12] A. Shafiee et al., "ISAAC: A convolutional neural network accelerator
  985 with in-situ analog arithmetic in crossbars," in *Proc. ISCA*, 2016,
  986 pp. 14–26.
- <sup>987</sup> [13] S. Jain, A. Ranjan, K. Roy, and A. Raghunathan, "Computing in memory with spin-transfer torque magnetic ram," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 26, no. 3, pp. 470–483, Mar. 2018.
- B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, "Architecting phase change memory as a scalable dram alternative," in *Proc. ISCA*, 2009, pp. 2–13.
- [15] J. Feldmann et al., "Parallel convolutional processing using an integrated
- photonic tensor core," *Nature*, vol. 589, pp. 52–58, Jan. 2021.
- H. Zhu et al., "ELight: Towards efficient and aging-resilient photonic inmemory neurocomputing," *IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.*, vol. 42, no. 3, pp. 820–833, Mar. 2023.
- 997 [17] Y. Chen, "ReRAM: History, status, and future," *IEEE Trans. Electron Devices*, vol. 67, no. 4, pp. 1420–1433, Apr. 2020.
- P. Chi, S. Li, Y. Cheng, Y. Lu, S. H. Kang, and Y. Xie, "Architecture design with STT-RAM: opportunities and challenges," in *Proc. IEEE ASP-DAC*, 2016, pp. 109–114.
- F. Sunny, E. Taheri, M. Nikdast, and S. Pasricha, "A survey on silicon photonics for deep learning," *ACM J. Emerg. Technol. Comput. Syst.*, vol. 17, no. 4, p. 61, 2021.
- S. Afifi, F. Sunny, M. Nikdast, and S. Pasricha, "TRON: Transformer neural network acceleration with non-coherent silicon photonics," in *Proc. ACM GLSVLSI*, 2023, pp. 15–21.
- F. P. Sunny, A. Mirza, M. Nikdast, and S. Pasricha, "ROBIN: A robust optical binary neural network accelerator," *ACM Trans. Embedded Comput. Syst.*, vol. 20, no. 5S, p. 57, 2021.
- S. Afifi, F. Sunny, A. Shafiee, M. Nikdast, and S. Pasricha, "GHOST: A graph neural network accelerator using silicon photonics," *ACM Trans. Embedded Comput. Syst.*, vol. 22, no. 5S, p. 102, 2023.
- F. Sunny, A. Shafiee, B. Charbonnier, M. Nikdast, and S. Pasricha,
  "COMET: A cross-layer optimized optical phase change main memory
  architecture," 2023, arXiv:2311.08566.
- 1017 [24] A. Shafiee, S. Pasricha, and M. Nikdast, "A survey on optical phase1018 change memories: The promise and challenges," *IEEE Access*, vol. 11,
  1019 pp. 11781–11803, 2023.
- A. Shafiee, B. Charbonnier, S. Pasricha, and M. Nikdast, "Design-space exploration in PCM-based photonic memory," in *Proc. ACM GLSVLSI*, 2023, pp. 533–538.
- 1023 [26] Y. Choi et al., "A 20nm 1.8V 8Gb PRAM with 40MB/s program 1024 bandwidth," in *Proc. IEEE ISSCC*, 2012, pp. 46–48.
- 1025 [27] D. Loke et al., "Breaking the speed limits of phase-change memory,"1026 Science, vol. 336, no. 6088, pp. 1566–1569, 2012.
- 1027 [28] A. Chen, "A review of emerging non-volatile memory (NVM) tech1028 nologies and applications," *Solid-State. Electron.*, vol. 125, pp. 25–38,
  1029 Nov. 2016.
- 1030 [29] A. Pirovano, A. L. Lacaita, A. Benvenuti, F. Pellizzer, and R. Bez,
  1031 "Electronic switching in phase-change memories," *IEEE Trans. Electron*1032 *Devices*, vol. 51, no. 3, pp. 452–459, Mar. 2004.
- I. G. Thakkar and S. Pasricha, "DyPhase: A dynamic phase change memory architecture with symmetric write latency and restorable endurance," *IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.*, vol. 37, no. 9, pp. 1760–1773, Sep. 2018.
- 1037 [31] A. Narayan, Y. Thonnart, P. Vivet, A. Coskun, and A. Joshi,
  1038 "Architecting optically controlled phase change memory," *ACM Trans.*1039 *Archit. Code Optim.*, vol. 19, no. 4, pp. 1–26, 2022.
- 1040 [32] G. Yang, C. Demirkiran, Z. E. Kizilates, C. A. R. Ocampo,
  1041 A. K. Coskun, and A. Joshi, "Processing-in-memory using optically1042 addressed phase change memory," in *Proc. ACM/IEEE ISLPED*, 2023,
  1043 pp. 1–6.
- 1044 [33] A. N. Tait, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal, "Broadcast and weight: An integrated network for scalable photonic spike processing," *J. Lightw. Technol.*, vol. 32, no. 21, pp. 3427–3439, Nov. 1, 2014.

- [34] M. M. Masnad, G. Zhang, D.-X. Xu, Y. Grinberg, and O. Liboiron- 1048 Ladouceur, "Fabrication error tolerant broadband mode converters and 1049 their working principles," *Opt. Exp.*, vol. 30, no. 14, pp. 25817–25829, 1050 2022.
- [35] H. Xu, D. Dai, and Y. Shi, "Silicon integrated nanophotonic devices for 1052 on-chip multi-mode interconnects," *App. Sci.*, vol. 12, no. 10, p. 6365, 1053 2020. 1054
- [36] C. Li, D. Liu, and D. Dai, "Multimode silicon photonics," 1055 Nanophotonics, vol. 8, no. 2, pp. 227–247, 2019.
- [37] "Ansys lumerical." Accessed: Mar. 28, 2024. [Online]. Available: https: 1057 //www.lumerical.com/products/ 1058
- [38] "Lumopt." Accessed: Mar. 8, 2024. [Online]. Available: https://github. 1059 com/chriskeraly/lumopt.git 1060
- [39] F. Sunny, M. Nikdast, and S. Pasricha, "A silicon photonic accelerator 1061 for convolutional neural networks with heterogeneous quantization," in 1062 *Proc. ACM GLSVLSI*, 2022, pp. 367–371. 1063
- [40] P. Dong, C. Xie, L. Chen, N. K. Fontaine, and Y.-K. Chen, 1064 "Experimental demonstration of microring quadrature phase-shift keying 1065 modulators," *Opt. Lett.*, vol. 37, no. 7, pp. 1178–1180, 2012. 1066
- [41] F. Sunny, A. Mirza, M. Nikdast, and S. Pasricha, "CrossLight: A crosslayer optimized silicon photonic neural network accelerator," in *Proc.* 1068 *IEEE/ACM DAC*, 2021, pp. 1069–1074.
- [42] Z. Lu, D. Celo, P. Dumais, E. Bernier, and L. Chrostowski, "Comparison 1070 of photonic 2×2 3-dB couplers for 220 nm silicon-on-insulator plat- 1071 forms," in *Proc. IEEE GFP*, 2015, pp. 57–58. 1072
- [43] M. R. Yahya, N. Wu, Z. Fang, F. Ge, and M. H. Shah, "A low insertion 1073 loss 5×5 optical router for mesh photonic network-on-chip topology," 1074 in *Proc. IEEE CSUDET*, 2019, pp. 164–169. 1075
- [44] S. Pasricha and S. Bahirat, "OPAL: A multi-layer hybrid photonic NoC 1076 for 3D ICs," in *Proc. IEEE ASPDAC*, 2011, pp. 345–350. 1077
- [45] L. Zhang et al., "New-generation silicon photonics beyond the single 1078 mode regime," 2021, arXiv:2104.04239. 1079
- [46] M. Bahadori, M. Nikdast, Q. Cheng, and K. Bergman, "Universal 1080 design of waveguide bends in silicon-on-insulator photonics plat- 1081 form," *J. Lightw. Technol.*, vol. 37, no. 10, pp. 3044–3054, Jul. 1, 1082 2019.
- [47] A. W. Poon, X. Luo, F. Xu, and H. Chen, "Cascaded microresonator- 1084 based matrix switch for silicon on-chip optical interconnection," *Proc.* 1085 *IEEE*, vol. 97, no. 7, pp. 1216–1238, Jul. 2009. 1086
- [48] Z. Fang et al., "Ultra-low-energy programmable non-volatile silicon 1087 photonics based on phase-change materials with graphene heaters," *Nat.* 1088 *Nanotechnol.*, vol. 17, no. 8, pp. 842–848, 2022. 1089
- [49] M. Horowitz, "1.1 Computing's energy problem (and what we can do 1090 about it)," in *Proc. IEEE ISSCC*, vol. 57, 2014, pp. 10–14. 1091
- [50] D. Li, X. Zhao, Y. Shen, S. Liu, and Z. Zhu, "A 7-bit 3.8-GS/s 2-way 1092 time-interleaved 4-bit/Cycle SAR ADC 16× time-domain interpolation 1093 in 28-nm CMOS," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 70, 1094 no. 9, pp. 3557–3566, 2023. 1095
- [51] T. O. Dickson et al., "A 72-GS/s, 8-Bit DAC-based wireline transmitter 1096 in 4-nm FinFET CMOS for 200+ Gb/s serial links," *IEEE J. Solid-State* 1097 *Circuits*, vol. 58, no. 4, pp. 1074–1086, Apr. 2023. 1098
- [52] "Cacti." Accessed: Nov. 6, 2024. [Online]. Available: https://github.com/ 1099 HewlettPackard/cacti 1100
- [53] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image 1101 recognition," in *Proc. CVPR*, 2016, pp. 770–778. 1102
- [54] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, "Rethinking 1103 the inception architecture for computer vision," 2015, arXiv:1512.00567. 1104
- [55] A. G. Howard et al., "MobileNets: Efficient convolutional neural 1105 networks for mobile vision applications," 2017, arXiv:1704.04861. 1106
- [56] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, 1107 and K. Keutzer, "SqueezeNet: AlexNet-level accuracy with 50x fewer 1108 parameters and <0.5MB model size," 2016, arXiv:1602.07360. 1109</p>
- [57] "CIFAR100 and CIFAR10 datasets." Accessed: Nov. 6, 2024. [Online]. 1110 Available: https://www.cs.toronto.edu/~kriz/cifar.html 1111
- [58] "SVHN dataset." Accessed: Nov. 6, 2024. [Online]. Available: 1112 http://ufldl.stanford.edu/housenumbers/ 1113
- [59] "STL10 dataset." Accessed: Nov. 6, 2024. [Online]. Available: 1114 https://cs.stanford.edu/~acoates/stl10/ 1115
- [60] "Imagenette dataset." Accessed: Nov. 6, 2024. [Online]. Available: 1116 https://github.com/fastai/imagenette 1117
- [61] M. Poremba, T. Zhang, and Y. Xie, "NVMain 2.0: A user- 1118 friendly memory simulator to model (non-) volatile memory 1119 systems," *IEEE Comput. Archit. Lett.*, vol. 14, no. 2, pp. 140–143, 1120 Jul.–Dec. 2015. 1121