# **ROI-HIT:** Region of Interest-Driven High-Dimensional Microarchitecture Design Space Exploration

Xuyang Zhao<sup>®</sup>, Tianning Gao<sup>®</sup>, Aidong Zhao<sup>®</sup>, Zhaori Bi<sup>®</sup>, Member, IEEE, Changhao Yan<sup>®</sup>, Member, IEEE, Fan Yang<sup>10</sup>, Member, IEEE, Sheng-Guo Wang<sup>10</sup>, Life Senior Member, IEEE, Dian Zhou, Senior Member, IEEE, and Xuan Zeng<sup>(D)</sup>, Senior Member, IEEE

Abstract-Exploring the design space of RISC-V processors 2 faces significant challenges due to the vastness of the high-3 dimensional design space and the associated expensive simulation 4 costs. This work proposes a region of interest (ROI)-driven 5 method, which focuses on the promising ROIs to reduce the 6 over-exploration on the huge design space and improve the 7 optimization efficiency. A tree structure based on self-organizing 8 map (SOM) networks is proposed to partition the design 9 space into ROIs. To reduce the high dimensionality of design 10 space, a variable selection technique based on a sensitivity 11 matrix is developed to prune unimportant design parameters 12 and efficiently hit the optimum inside the ROIs. Moreover, an 13 asynchronous parallel strategy is employed to further save the 14 time taken by simulations. Experimental results demonstrate the 15 superiority of our proposed method, achieving improvements of 16 up to 43.82% in performance, 33.20% in power consumption, 17 and 11.41% in area compared to state-of-the-art methods.

Index Terms-Asynchronous parallel optimization, high-18 19 dimensional design space exploration (DSE), region of interest 20 (ROI), RISC-V microarchitecture, variable selection (VS).

#### 21

#### I. INTRODUCTION

▼ USTOMIZED microprocessors in embedded systems 22 ✓ are in demand for a wide range of applications, 23 24 such as mobile devices, medical equipment, and automotive 25 systems [1]. Application-specific performance requirements 26 necessitate tailored microprocessor designs. For instance,

Manuscript received 4 August 2024; accepted 4 August 2024. This work was supported in part by the National Natural Science Foundation of China (NSFC) Research Projects under Grant 62141407, Grant 62304052, and Grant 92373207. This article was presented at the International Conference on Hardware/Software Codesign and System Synthesis (CODES + ISSS) 2024 and appeared as part of the ESWEEK-TCAD Special Issue. This article was recommended by Associate Editor S. Dailey. (Corresponding authors: Zhaori Bi; Xuan Zeng.)

Xuyang Zhao, Aidong Zhao, Zhaori Bi, Changhao Yan, Fan Yang, and Xuan Zeng are with the School of Microelectronics, State Key Laboratory of Integrated Chips and Systems, Fudan University, Shanghai 200433, China (e-mail: zhaori\_bi@fudan.edu.cn; xzeng@fudan.edu.cn).

Tianning Gao is with the Department of Electrical Engineering, University of Texas at Dallas, Dallas, TX 75080 USA.

Sheng-Guo Wang is with the School of Microelectronics, State Key Laboratory of Integrated Chips and Systems, Fudan University, Shanghai 200433, China and also with the College of Engineering, The University of North Carolina at Charlotte, Charlotte, NC 28223 USA.

Dian Zhou is with the School of Microelectronics. State Key Laboratory of Integrated Chips and Systems, Fudan University, Shanghai 200433, China, and also with the Department of Electrical Engineering, University of Texas at Dallas, Dallas, TX 75080 USA.

Digital Object Identifier 10.1109/TCAD.2024.3443006

battery-powered devices prioritize low-power consumption for 27 extended operation. Besides, the integration of microproces-28 sors into system-on-chips (SoCs) requires them to harmonize 29 with other system components and achieve target performance 30 within constrained resources. 31

However, a tradeoff between performance metrics exists: 32 enhancements in clock frequency often incur increased power 33 consumption or enlarged die areas. So, when application 34 requirements change, designs should be reoptimized. To 35 address this challenge, multiobjective optimization consid-36 ering performance, power, and area (PPA) is introduced. 37 Multiobjective optimization empowers designers to select the 38 most suitable solution based on their specific performance 39 requirements. When these requirements change, designers can 40 simply select a different design from the Pareto optimal set 41 without the need for reoptimization.

In industry, it is crucial to meet strict and predetermined 43 deadlines for product delivery. In recent years, RISC-V, an 44 emerging instruction set architecture (ISA), has gained lots of 45 attention in both academia and industry because of its open-46 source licenses and flexibility [2]. To achieve the agile design 47 of the RISC-V SoC, the Berkeley architecture research (BAR) Group develops the Chipyard [3] framework. It uses Chisel [4] 49 hardware construction language to generate the synthesizable 50 and parameterizable processor cores (such as BOOM [5], [6] 51 and Rocket [7]) in the SoC. This enables the generation 52 of corresponding register-transfer level (RTL) designs for 53 processor cores based on specified architecture parameters. 54 So, there is a growing emphasis on identifying the Pareto 55 optimal design parameters, namely, design space exploration 56 (DSE) [1], in the early stages of the design process. By 57 utilizing multiobjective optimization, DSE provides insights 58 for subsequent design iterations.

Usually, processor simulators, such as GEM5 [8], are used 60 to obtain the PPA metrics for microarchitectures. However, 61 there exists a disparity between simulations and actual results. 62 To achieve PPA results that closely resemble the actual hard-63 ware, the Chipyard framework leverages the HAMMER [9] 64 framework for automated VLSI flow. This framework utilizes 65 electronic design automation (EDA) tools for RTL simulation, 66 logic synthesis, power analysis, and other tasks, ensuring 67 the accuracy of PPA results. Chipyard supports automated 68 generation from Chisel to layout. The flow incorporates the

1937-4151 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

<sup>70</sup> impact of various parameters of EDA tools, resulting in a
 <sup>71</sup> parameter space exceeding 50 dimensions and a design space
 <sup>72</sup> of approximately 10<sup>30</sup>.

However, the high dimensionality poses challenges for
multiobjective optimization algorithms, as their computational
complexity typically increases linearly or exponentially with
the number of dimensions. This high dimensionality leads to
long modeling and optimization time. Moreover, given a flow,
including RTL netlist generation with Chipyard BOOM generator, RTL simulation with Verilator [10], and logic synthesis
with Cadence Genus, it typically requires 1-6 h of execution
time. Consequently, the DSE process with hundreds of evaluation would take months of time. Therefore, it is imperative
to develop high-dimensional microarchitecture DSE methods
that can achieve satisfactory results with minimal evaluation

Evolutionary-based approaches, such as NSGA-II [11] 86 <sup>87</sup> and MOEA/D [12], are commonly used multiobjective 88 optimization methods. These methods generate a population 89 of candidate solutions and iteratively refine them through 90 crossover, mutation, and selection operators. However, they <sup>91</sup> require a substantial number of simulation evaluations, which impractical for the DSE based on the VLSI flow. In 92 is 93 order to conduct the DSE within an acceptable time, <sup>94</sup> researchers [13], [14] have utilized Bayesian optimization 95 (BO). BO trains surrogate models to approximate the PPA 96 results. These models are then used to optimize an acquisition 97 function and identify promising candidate solutions for actual <sup>98</sup> simulation. This approach significantly reduces the number <sup>99</sup> of required simulations and improves sampling efficiency. 100 BOOM-Explorer [13] proposes an efficient initial sampling 101 method MicroAL via the characteristics of microarchitecture 102 and utilizes the Gaussian process with deep kernel learning 103 (DKL-GP) as surrogate models. BOOM-Explorer leverages the <sup>104</sup> Chipyard framework, making it a valuable tool for exploring 105 not only the BOOM core but also other RISC-V cores. GRL-106 DSE [14] introduces graph representation learning, which 107 leverages a graph neural network (GNN) to project the <sup>108</sup> microarchitecture design space into graph embedding space. While these methods effectively model the microarchitec-109 110 ture design space and reduce simulation requirements, both 111 MicroAL and the GNN require training on the entire design <sup>112</sup> space. Consequently, they struggle to handle the vast design <sup>113</sup> space encountered in high-dimensional DSE problems.

Recent researches [15], [16], [17], [18], [19] have been 114 115 proposed to address high-dimensional optimization problems. <sup>116</sup> Trust region-based methods [15], [16] aim to tackle the vast <sup>117</sup> design space. TuRBO [15] employs local models within trust <sup>118</sup> regions instead of a global model to mitigate overemphasized 119 global exploration. MORBO [16] leverages hypervolume con-120 tribution (HVC) to evaluate data point quality and extends 121 TuRBO from single-objective to multiobjective optimization. 122 Despite reducing the design space size, these local region-123 based algorithms still face high dimensionality. The modeling 124 and optimization time complexity increases linearly or faster 125 with the number of dimensions. Moreover, the existing data 126 point distribution is not considered when determining the trust 127 region size. This can result in trust regions with insufficient 128 data points, leading to inaccurate models.

Embedding-based methods [17], [18] have been proposed 129 to reduce the dimensionality of the search space. REMBO [17] 130 leverages a random embedding matrix to map the high- 131 dimensional space to a lower-dimensional one, resulting in 132 faster optimization. BODi [18] employs dictionary embedding 133 based on Hamming distance for dimensionality reduction and 134 improved optimization efficiency. However, these methods 135 require setting a fixed effective dimension for the embedding 136 space in advance, which can be challenging in practice. 137 Additionally, different microarchitecture design parameters 138 have varying influences on performance outcomes. The afore- 139 mentioned methods only consider the design space when 140 training the embedding matrix, neglecting the correlations 141 between design space and performance space and the dif- 142 ferent importance of different design variables. Recently, 143 REMOTune [19] combines MORBO [16] and REMBO [17] 144 to reduce both size and dimensionality of design space. 145 Nonetheless, it still focuses solely on the design space, 146 overlooking the distribution and performance of existing data 147 points. 148

Moreover, with the popularity of multicore devices and 149 cloud computing, parallel/batch simulation techniques can be 150 used to conduct more simulations within a limited time budget, 151 leading to better solutions. However, the simulation time 152 can vary significantly based on different microarchitecture 153 parameters. Existing batch optimization methods, such as 154 MORBO [16] and REMOTune [19], are synchronous. It means 155 these methods must wait for the slowest simulation in the batch 156 to return before starting the next optimization iteration, which 157 results in idle devices and wasted time. 158

This article proposes ROI-HIT, a region of interest (ROI)driven high-dimensional microarchitecture DSE algorithm. <sup>160</sup> By focusing on the promising ROIs, we reduce the overexploration on the vast design space and advance the Pareto front (PF) more quickly. Consequently, we can obtain superior results within a constrained time budget. To further shorten the optimization time, we prune unimportant variables via a sensitivity matrix and reduce the number of dimensions used for modeling and optimization. For time-consuming VLSI flow simulations, an asynchronous parallel strategy is employed. <sup>168</sup> Our contributions are summarized as follows. <sup>169</sup>

- We propose a novel space partitioning method based 170 on a self-organizing map (SOM) tree to address the 171 vast design space of high-dimensional microarchitec- 172 ture DSE. A SOM network maps the high-dimensional 173 design space into a 2-D feature space maintaining 174 neighbor relationships and density. We define the feature 175 regions covering the current Pareto set (PS) as ROIs, 176 which are more promising to advance the PF. By 177 exploiting the ROIs, we reduce the exploration for poor 178 designs on the vast high-dimensional design space and 179 improve the exploration efficiency. 180
- We propose a variable selection (VS) method based on a 181 sensitivity matrix to address the high dimensionality of 182 DSE. To reduce the number of dimensions for modeling 183 and optimization, a sensitivity matrix is calculated inside 184 an ROI. ROI-HIT identifies important design variables 185 according to the sensitivity matrix and merges unimportant variables into a 1-D auxiliary variable. Because 187

of the reduced dimensions, the runtime of ROI-HIT isshortened.

We propose an asynchronous parallel/batch framework 3) 190 to address the time-consuming simulations of DSE. 191 Considering the tradeoffs between the PPA results, the 192 expected hypervolume improvement (EHVI) acquisition 193 function is optimized to choose the points to be simu-194 lated. Since the simulations based on the VLSI flow are 195 independent and spend lots of time, we will simulate 196 the next design if there is an idle worker, regardless of 197 whether all the simulations have been completed. This 198 mechanism further enhances the optimization efficiency. 199 The experimental results demonstrate that ROI-HIT outper-200

<sup>201</sup> forms the state-of-the-art methods, achieving 3.47%-9.19%<sup>202</sup> improvement of hypervolume (HV) within the same time <sup>203</sup> budget and  $3.58\times-341.68\times$  speed-up of time in reaching <sup>204</sup> the same HV. In terms of PPA metrics, ROI-HIT achieves <sup>205</sup> improvements of up to 43.82% in performance, 33.20% in <sup>206</sup> power consumption, and 11.41% in area. Compared with the <sup>207</sup> official designs of RISC-V BOOM and Rocket microarchi-<sup>208</sup> tecture, ROI-HIT can obtain 18.13%-30.05% improvement of <sup>209</sup> power with higher performance and smaller area.

The remainder of this article is structured as follows: 211 Section II introduces the problem formulation and the basics 212 of BO. Section III details the ROI-HIT algorithm. Section IV 213 presents the experimental results and compares the per-214 formances of our proposed algorithm with state-of-the-art 215 algorithms. Section V concludes this article.

#### II. BACKGROUND

#### 217 A. Problem Formulation

<sup>218</sup> The design space of microarchitecture is explored for <sup>219</sup> better PPAs. So, the DSE problem can be formulated as a <sup>220</sup> multiobjective optimization problem in

221

216

minimize 
$$f^{(1)}(\mathbf{x}), \dots, f^{(M)}(\mathbf{x})$$
 (1)

where x represents input parameters and  $f^{(i)}(\cdot)$  represents pobjective functions. In this article, x involves the design parameters, such as *FetchWidth* and *DecodeWidth*, while  $f^{(i)}$ includes power consumption, area, and the average cycle per per instruction (CPI).

For DSE problems, it is not straightforward to determine the superiority of one design over another based solely on individual objective metrics, such as CPI, power consumption, or area. A design with a lower CPI may have higher-power consumption and a larger area, while a design with lowerpower consumption and a smaller area may have a higher power consumption and a smaller area may have a higher concept of *domination* is introduced. For arbitrary data points a and b,  $a \prec b$  (*a dominates b*) if

236 
$$\forall i \in \{1, \dots, M\} \quad f^{(i)}(\boldsymbol{a}) \leq f^{(i)}(\boldsymbol{b})$$
  
237 and  $\exists i \in \{1, \dots, M\} \quad f^{(i)}(\boldsymbol{a}) < f^{(i)}(\boldsymbol{b}).$   
238 (2)

<sup>239</sup> The multiobjective optimization yields a PS, as defined in

PS = {
$$\boldsymbol{x} \in X \mid \nexists \ \boldsymbol{x'} \in X, \ \boldsymbol{x'} \prec \boldsymbol{x}$$
} (3)



Fig. 1. Illustration of HV, HVI, and HVC.

where X denotes input space. The entire nondominated  $_{241}$  performance space is referred to as PF in  $_{242}$ 

$$PF = \left\{ f^{(i)}(\mathbf{x}) \mid \mathbf{x} \in PS, \ i = 1, \dots, M \right\}.$$
(4) 243

Namely, all the points in the PS are (Pareto) optimal for DSE 244 problems. 245

To evaluate the quality of PF in multiobjective optimization, <sup>246</sup> HV is a commonly used performance indicator. As Fig. 1 <sup>247</sup> illustrates, given a reference point *r* and a finite PF, the HV <sup>248</sup> is the *M*-dimensional Lebesgue measure  $\lambda_M$  of the space <sup>249</sup> dominated by PF, as defined in <sup>250</sup>

$$HV(PF, \mathbf{r}) = \lambda_M \left( \bigcup_{j=1}^{|PF|} \left[ \mathbf{r}, \mathbf{y}_j \right] \right)$$
(5) 25'

where  $[r, y_i]$  denotes the hyper-rectangle bounded by r and  $y_i$ , <sup>252</sup> with  $\{y_j\}_{j=1}^{|PF|}$  comprising the PF. A larger HV corresponds to a <sup>253</sup> better PF, namely a better optimization result. <sup>254</sup>

To compare the importance of various points in the PS, 255 HVC is defined as 256

$$HVC(\boldsymbol{x}_i|PF, \boldsymbol{r}) = HV(PF, \boldsymbol{r}) - HV(PF \setminus \boldsymbol{f}(\boldsymbol{x}_i), \boldsymbol{r}). \quad (6) \quad {}_{257}$$

Similarly, to select the best-candidate point, HV improvement (HVI) of a candidate point  $x^*$  is defined as follows: 259

$$HVI(\boldsymbol{x}^*|PF, \boldsymbol{r}) = HV(PF \cup \boldsymbol{f}(\boldsymbol{x}^*), \boldsymbol{r}) - HV(PF, \boldsymbol{r}).$$
(7) 260

Fig. 1 is an example of a bi-objective optimization. The purple area is the HVC of  $x_i$ , which is the exclusive contribution 262 of  $x_i$  to the PF, and the yellow area is the HVI of  $x^*$ , which 263 is the improvement brought by  $x^*$  to the PF. 264

## B. Bayesian Optimization 265

BO comprises two fundamental components: 1) the surrogate model and 2) the acquisition function. The surrogate 267 model integrates existing prior knowledge and newly acquired 268 data points to estimate a posterior distribution with predictive 269 mean and uncertainty. Subsequently, the acquisition function 270 is optimized with respect to the model to identify the most 271 promising candidate for simulation, striking a balance between 272 exploring unknown regions and exploiting local optima. 273

One commonly used surrogate model is Gaussian process 274 regression (GPR) model [20]. Given a *d*-dimensional input 275

| Algorithm 1: Multiobjective BO Framework                            |  |  |  |  |  |  |  |  |
|---------------------------------------------------------------------|--|--|--|--|--|--|--|--|
| <b>Input</b> : The size of the initial dataset $N_{init}$ , and the |  |  |  |  |  |  |  |  |
| maximum number of iterations $N_{iter}$                             |  |  |  |  |  |  |  |  |
| 1 Generate an initial dataset $D_0 = \{X, Y\}$ by randomly          |  |  |  |  |  |  |  |  |
| sampling, where $Y = \{y^{(i)}   i = 1,, M\};$                      |  |  |  |  |  |  |  |  |
| 2 for $t = 0 \rightarrow N_{iter} - 1$ do                           |  |  |  |  |  |  |  |  |
| 3 Construct and train $M$ GPR models with $D_t$ ;                   |  |  |  |  |  |  |  |  |
| 4 Select the next candidate point $x_t$ by optimizing the           |  |  |  |  |  |  |  |  |
| acquisition function $\alpha(\mathbf{x}; D_t)$ ;                    |  |  |  |  |  |  |  |  |
| 5 Simulate the candidate $x_t$ and get the result $y_t$ ;           |  |  |  |  |  |  |  |  |
| 6 Combine $\{D_t, \{x_t, y_t\}\}$ to form the dataset $D_{t+1}$ ;   |  |  |  |  |  |  |  |  |
| 7 end                                                               |  |  |  |  |  |  |  |  |



<sup>276</sup> vector  $\mathbf{x}$ , assume that  $y = f(\mathbf{x}) + \epsilon$ , where  $\epsilon \sim \mathcal{N}(0, \sigma_n^2)$ <sup>277</sup> denotes the observation noise. The dataset  $D = \{X, y\}$  is <sup>278</sup> formed by N observations, where X is a set of input vectors <sup>279</sup>  $X = \{\mathbf{x}_1, \dots, \mathbf{x}_N\}$ , and  $\mathbf{y}$  is the corresponding outputs  $\mathbf{y} =$ <sup>280</sup>  $\{y_1, \dots, y_N\}$ .

A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution. A GP is fully characterized by its mean function  $m(\cdot)$  and its kernel or covariance function  $k(\cdot, \cdot)$ . If f follows a GP, the GPR model can provide posterior distribution for an arbitrary location  $x^*$  as follow [20]:

$$\begin{cases} \mu(\mathbf{x}^{*}) = m(\mathbf{x}) + k(\mathbf{x}^{*}, X) [K + \sigma_{n}^{2}I]^{-1}(\mathbf{y} - m(\mathbf{x})) \\ \sigma^{2}(\mathbf{x}^{*}) = k(\mathbf{x}^{*}, \mathbf{x}^{*}) - k(\mathbf{x}^{*}, X) [K + \sigma_{n}^{2}I]^{-1}k(X, \mathbf{x}^{*}) \end{cases}$$
(8)

where  $\mu(\mathbf{x}^*)$  is the predictive mean,  $\sigma(\mathbf{x}^*)$  represents the uncertainty estimation, i.e., the standard deviation,  $k(\mathbf{x}^*, X) = k(\mathbf{x}^*, \mathbf{x}_i)|i = 1, ..., N$ , and  $k(X, \mathbf{x}^*) = k(\mathbf{x}^*, X)^T$  and K = k(X, X). When no prior information about  $f(\mathbf{x})$  is available,  $m(\mathbf{x})$  is always set to zero. In this work, we set  $m(\mathbf{x}) = 0$  and the kernel function as the Matern52 kernel [20].

BO optimizes the acquisition function to acquire the next simulation point. Several commonly used single-objective acquisition functions include lower-confidence bound (LCB), probability of improvement (PI), and expected improvement (EI) [20]. EHVI [21] is an acquisition function that is specified for multiobjective optimization. The EHVI represents the expectation of HVI over the posterior of the GPR models, typically approximated using Monte Carlo integration [22]

$$\alpha_{\text{EHVI}}(X_{\text{cand}}|\text{PF}, \mathbf{r}) \approx \frac{1}{N} \sum_{t=1}^{N} \text{HVI}(\mathbf{f}_{t}(X_{\text{cand}})|\text{PF}, \mathbf{r})$$
(9)

where  $f_t(\cdot)$ , t = 1, ..., N are sampled from the GPR models. To sum up, a general multiobjective BO framework is shown and in Algorithm 1.

#### 306

## III. PROPOSED METHOD

## 307 A. Overview of ROI-HIT Framework

Our proposed ROI-HIT framework is presented in Fig. 2. The green part represents the stage where ROI-HIT acquires ROIs in the global design space. The red part represents the I local BO inside an ROI, and the blue part represents the



Fig. 2. Framework of ROI-HIT method.

asynchronous parallel strategy. In the global stage, a SOM 312 tree is constructed for space partitioning. At each level of 313 the SOM tree, a SOM network is trained on the current 314 dataset to map the high-dimensional design space into a 2-D 315 feature space. The SOM network ensures that neighboring 316 points in the design space remain adjacent in the feature space. 317 The regions associated with the PS in the feature space are 318 defined as ROIs. By focusing on ROIs, ROI-HIT can reduce 319 the over-exploration in the vast design space and advance the 320 PF faster. Inside each ROI, ROI-HIT calculates a sensitivity 321 matrix via the data points and uses it to prune the unimportant 322 variables. Because of the reduced dimensions for modeling 323 and optimization, the optimization time to choose candidates 324 is shortened. In order to save the simulation time, ROI-HIT 325 supports asynchronous parallel simulation. A new round of 326 optimization will begin when there are idle workers, regardless 327 of the status of the ongoing simulation. After reaching the 328 maximum runtime budget, ROI-HIT returns the optimal PS. 329 The combination of ROI exploitation, unimportant variable 330 pruning, and asynchronous parallel simulation significantly 331 enhances the exploration efficiency. 332

#### B. SOM-Tree-Based ROI Acquisition Method

For the high-dimensional DSE, the simulation points are <sup>334</sup> sparse relative to the vast design space. Besides, the commonly <sup>335</sup> used GPR models have the implicit homogeneity [15], while <sup>336</sup> the design space is heterogeneous. So, it is hard for the global <sup>337</sup> models on simulation results to give accurate estimates of <sup>338</sup> the entire space. At the same time, the space far from the <sup>339</sup> simulated points has a large uncertainty, which makes the <sup>340</sup> optimization method tend to over-explore. To address these <sup>341</sup> problems, ROI-HIT exploits the more promising ROIs to <sup>342</sup> reduce the over-exploration and advance the PF faster. <sup>343</sup>

333

RTL netlists of RISC-V SoCs can be generated by a <sup>344</sup> dedicated generator based on specific architecture parameters. <sup>345</sup> So, significant variations in these parameters can lead to <sup>346</sup> substantial differences in the generated netlists, resulting in <sup>347</sup> considerable changes to the corresponding PPA characteristics. <sup>348</sup> On the other hand, similar design parameters exhibit similar <sup>349</sup> performance, as Fig. 3(a) and (b) show. Meanwhile, models <sup>350</sup> can give more accurate estimates of the regions near the <sup>351</sup> sampled points than the others. As a result, we designate <sup>352</sup> the regions near the PS as ROIs and conduct local BO inside <sup>353</sup> the ROIs to generate the candidates to be simulated. <sup>354</sup>



Fig. 3. Illustration of (a) PPA space, (b) high-dimensional design space, (c) SOM network, and (d) ROIs in 2-D feature space. In (a), we prioritize the regions near the current PF, shown as orange and blue regions, because they are more promising to advance the PF. These regions correspond to the orange and blue irregular design spaces in (b). To acquire the ROIs, ROI-HIT employs a SOM network, as shown in (c), to map the original design space in (b) to a 2-D feature space in (d), while the neighbors in the former are still neighbors in the latter. In (d), the feature regions, including the points in the PS (PS regions, shown as the regions with stars) and their adjoining areas (Adj. Regions, shown as the regions with circles), are designated as ROIs. If different ROIs have overlapping feature regions, shown as the regions with triangles, these ROIs will be combined into a single ROI covering all regions. By exploiting the ROIs, ROI-HIT is able to reduce the over-exploration in the vast design space and advance the PF faster.

(10)

1) SOM Network: To acquire the ROIs, a SOM network [23], [24] is used to map the high-dimensional design space into a 2-D feature space. SOM is an artificial network commonly utilized for high-dimensional visualization. It can ensure that neighbors in the highdimensional space remain neighbors in the low-dimensional feature space [24], as shown in Fig. 3.

Moreover, the existing local BO methods based on trust 362 <sup>363</sup> regions [15], [16], [19] solely modify the range of the local region without considering the distribution of simulated data 364 365 points. This may result in an insufficient number of simulated points in the local region, leading to inaccurate modeling of 366 that area. Consequently, more iterations of optimization may 367 be required to adjust the range of the trust region, potentially 368 wasting time and computational resources. However, SOM 369 networks consider the distribution of the dataset and are 370 trained to have similar densities in each region [24]. So, 371 the ROI-HIT based on SOM networks can provide a better 372 guarantee of sufficient local dataset size and accurate modeling 373 compared to methods based on trust regions. 374

After the SOM network maps the original design space are to a 2-D space, we label the feature regions, including the current PS and their adjacent regions as an ROI, as presented in Fig. 3(d). If different ROIs have overlapping regions, they are merged into a single ROI covering all regions. Because the feature regions obtained by the SOM network maintain the neighbor relationships in the design space, ROI-HIT can exploit the promising local regions covering the PS.

Any new point in the high-dimensional input space can be categorized by the trained SOM network to a region in the 2-D feature space. Candidate points obtained through acquisition function optimization should be assessed by the SOM network, and only those inside the ROIs are retained for subsequent optimization.

Illustrated in Fig. 3(c), the neurons in a SOM are arranged in a 2-D lattice, with each neuron having full connections to all the source nodes in the input layer. Each neuron has a vector  $w_i$  of weights associated

 $w_i = \{w_{ij}, j = 1, \ldots, n\}$ 

where *n* is the number of source nodes. A SOM involves the  $_{394}$  concept of a winning neuron. For a input *x*, the winning neuron  $_{395}$  is defined as  $_{396}$ 

$$\dot{u}_{\min} = \arg\min_{i} \|\boldsymbol{x} - \boldsymbol{w}_i\|. \tag{11} \quad \text{397}$$

The weight vector of the winning neuron is more similar <sup>398</sup> to x than the others, and it will undergo weight adjustment <sup>399</sup> to enhance its responsiveness to the input when encountered <sup>400</sup> again. Meanwhile, the weights of the neurons close to the <sup>401</sup> winning one need to be adjusted to produce a stronger response <sup>402</sup> to x in order to preserve the topological order. So the weights <sup>403</sup> of each neuron are adjusted according to the following rule: <sup>404</sup>

$$w_i = w_i + \eta(t)h(i_{win})(x - w_i)$$
 (12) 405

where *t* is the number of iterations,  $\eta(t)$  is a learning rate, 406 which decreases as *t* increases, and  $h(i_{win})$  is a neighborhood 407 function which has high values for  $i_{win}$  and the neurons close 408 to  $i_{win}$  on the lattice. Usually, a Gaussian centered on  $i_{win}$  is 409 chosen as the neighborhood function. Algorithm 2 outlines the 410 training process of a SOM.

2) SOM Tree: To further improve optimization efficiency <sup>412</sup> and obtain the optimal solutions, we propose a SOM tree. <sup>413</sup> Through the SOM tree, we identify the most suitable ROI <sup>414</sup> for simulation. When optimization results cease to improve, <sup>415</sup> ROI-HIT will choose the most promising ROI for further <sup>416</sup> subdivision and explore inside the smaller ROIs. If ROI-HIT <sup>417</sup> converges to the local optimal, which means the SOM tree <sup>418</sup> reaches the maximum depth, the current ROI will return to <sup>419</sup> its parent node and perform a random restart. As a result, the <sup>420</sup> SOM tree improves optimization efficiency in the early stage <sup>421</sup> and yields a global optimization performance guarantee. <sup>422</sup>

As shown in Fig. 4(a), each node in the SOM tree represents 423 an ROI. As the overlapped ROIs should be merged into the 424 same ROI, the ROIs are independent of each other. At each 425 level, all ROIs are acquired by the same SOM network, which 426 is trained during the construction of nodes at that level. The 427 size of this SOM network is determined by the point number 428 and dimension of the dataset. In this work, the initial dataset 429 size is 100, and dimensionality reduction yields a design space 430

| Al  | Algorithm 2: Train SOM Network                                    |  |  |  |  |  |  |  |
|-----|-------------------------------------------------------------------|--|--|--|--|--|--|--|
| I   | <b>nput</b> : The dataset $X$ , the maximum iterations $N_{iter}$ |  |  |  |  |  |  |  |
| 1 I | nitialize the weights of each neuron at random;                   |  |  |  |  |  |  |  |
| 2 f | or $t = 0 \rightarrow N_{iter} - 1$ do                            |  |  |  |  |  |  |  |
| 3   | Randomly choose an input $x \in X$ ;                              |  |  |  |  |  |  |  |
| 4   | Determine the winning neuron by (11);                             |  |  |  |  |  |  |  |
| 5   | Adapt the weights of each neuron by (12);                         |  |  |  |  |  |  |  |
| 6 e | nd                                                                |  |  |  |  |  |  |  |

<sup>431</sup> of around 10 dimensions. Since an ROI occupies at least 4 <sup>432</sup> neighboring feature regions mapped by the SOM network, and <sup>433</sup> each feature region contains a similar number of data points, <sup>434</sup> we set the SOM network size as  $5 \times 5$ . This ensures that each <sup>435</sup> ROI contains at least 16 data points, facilitating subsequent <sup>436</sup> BO [25].

At each level, we select the ROI for optimization based on H38 its figure of merit (FOM)

FOM = 
$$N_{\text{Succ}} - N_{\text{Fail}} + \text{HVC/HVC}_{\text{max}}$$
 (13)

<sup>440</sup> where  $N_{\text{Succ}}$  is the number of times that the ROI finds a new <sup>441</sup> Pareto point and  $N_{\text{Fail}}$  is the number of times that the ROI <sup>442</sup> fails to find a new Pareto point. HVC is the largest HVC of <sup>443</sup> the Pareto points inside the ROI, HVC<sub>max</sub> is the largest HVC <sup>444</sup> among all Pareto points. The SOM tree updates  $N_{\text{Succ}}$  and <sup>445</sup>  $N_{\text{Fail}}$  as follows:

$$\begin{cases} N_{\text{Succ}} += 1, \ N_{\text{Fail}} = 0, \text{ if ROI gets a new PF} \\ N_{\text{Succ}} = 0, \ N_{\text{Fail}} += 1, \text{ otherwise.} \end{cases}$$
(14)

<sup>447</sup> By the FOM, ROI-HIT favors exploring regions that are more <sup>448</sup> likely to improve the PF.

If an ROI fails to find a new Pareto point for  $N_{\text{Fail}_{\text{max}}}$ consecutive times, namely,  $N_{\text{Fail}} \ge N_{\text{Fail}_{\text{max}}}$ , we define that the solution of the explored. As Fig. 4(b) shows, when all ROIs at the current level are well explored, we will split the best ROI, anamely, the ROI with HVC<sub>max</sub>. A new SOM network will be trained inside this ROI, and new ROIs will be acquired. These ROIs will be child nodes of the original ROI. In our approach, the new SOM network has the same size as the roriginal one. To ensure sufficient data points in the new ROI, the split is only performed when the number of data points in the ROI exceeds the number of data points in the initial dataset; otherwise, ROI-HIT continues to select ROIs based det on the FOM value.

<sup>462</sup> When the range of ROIs has become sufficiently small, i.e., <sup>463</sup> the depth of the SOM tree has reached *Depth*<sub>max</sub>, and all ROIs <sup>464</sup> are well explored, we consider these ROIs to have converged. <sup>465</sup> As is shown in Fig. 4(c), ROI-HIT will return to the parent <sup>466</sup> node and perform  $N_{RS}$  rounds of random search to explore <sup>467</sup> unexplored regions. If new Pareto points are found, the ROIs <sup>468</sup> will be updated and a new optimization iteration will begin. If <sup>469</sup> no new Pareto points are found, the search will proceed to an <sup>470</sup> upper-level node until a random search is performed globally.

#### 471 C. Variable Selection Based on Sensitivity Matrix

<sup>472</sup> For high-dimensional DSE, the time complexity for <sup>473</sup> modeling and optimization increases linearly or faster with the



Fig. 4. Illustration of SOM Tree. (a) Each node represents an ROI. Since the number of ROIs obtained from ROI-HIT is not fixed, the number of child nodes at each level varies. At the same level, all ROIs are acquired by the same SOM network. The next ROI to be optimized is selected based on its FOM value. (b) If all the ROIs at the current level are well explored, the best ROI will be split. A new SOM network will be trained inside the split ROI, and the smaller ROIs will be acquired via the new SOM for more detailed exploration. (c) If the ROIs are small enough and well explored, ROI-HIT will restart, which means the current ROI returns to its parent node and conducts a random search.

number of dimensions. The existing methods [17], [18], [19] 474 use embeddings to map the original high-dimensional space 475 to a low-dimensional intrinsic space, thus reducing the 476 dimensionality. However, these methods require specifying 477 the effective dimension number in advance, which can be 478 challenging since determining the accurate intrinsic dimension 479 of the problem is often difficult. 480

Meanwhile, different microarchitecture design parameters <sup>481</sup> have distinct influences on the PPA results. For instance, a <sup>482</sup> modification in *DecodeWidth* significantly affects the PPA <sup>483</sup> results, whereas the Boolean value of *EnableSFBOpt* makes <sup>484</sup> a minor impact. However, the embedding-based methods <sup>485</sup> ignore the different importance of design parameters on <sup>486</sup> performances. Moreover, the relative importance of different <sup>487</sup> design parameters varies across applications. Even seasoned <sup>488</sup> key parameters for a specific scenario due to the increasing <sup>490</sup> complexity of RISC-V SoCs and the opaque optimization <sup>491</sup> methods employed by EDA tools. <sup>492</sup>

Our proposed ROI-HIT obtains a sensitivity matrix to evaluate the importance of different variables according to the input variables and their performances. Unlike existing methods that require specifying the effective dimension number, ROI-HIT eliminates unimportant variables and automatically adjusts the number of dimensions to reduce optimization time.

A sensitivity matrix S [26] is calculated to measure the 499 importance of variables 500

$$S = \{s_{ij} \mid i = 1, \dots, n, \ j = 1, \dots, M\}$$
(15) 501

where *n* denotes the number of variables, namely, the dimension *d* of input vector  $\mathbf{x}$ , and *M* is the number of objectives. <sup>503</sup> The sensitivity coefficient  $s_{ij}$  is defined as <sup>504</sup>

$$s_{ij} = \frac{\operatorname{Var}\left[E\left(y_{j}|x_{i}\right)\right]}{\operatorname{Var}\left(y_{j}\right)} \approx \frac{\sum_{l=1}^{r_{i}} |A_{i}^{l}| \left[\left(\bar{y}_{j}\right)_{i}^{l} - \bar{y}_{j}\right]^{2}}{\sum_{\boldsymbol{x} \in X} \left[y_{j}(\boldsymbol{x}) - \bar{y}_{j}\right]^{2}}$$
 505

(16)



Fig. 5. Illustration of the VS based on the sensitivity matrix for RISV-V BOOM microarchitecture [6]. The maximum importance value ( $V_{max}$ ) represents the peak importance, and the threshold for VS is determined by the product of p and  $V_{max}$ .

506 
$$\bar{y}_j = \frac{1}{|X|} \sum_{\mathbf{x} \in X} y_j(\mathbf{x}), \quad (\bar{y}_j)_i^l = \frac{1}{|A_i^l|} \sum_{\mathbf{x} \in A_i^l} y_j(\mathbf{x})$$

<sup>507</sup> where *X* represents the dataset, and  $r_i$  represents the number <sup>508</sup> of value which  $x_i$  can be and  $A_i^l$  represents the set of points <sup>509</sup> for which the variable  $x_i$  is set to its *l*th value.

<sup>510</sup> By definition,  $s_{ij}$  is a value in the range of 0 to 1. The value <sup>511</sup> closer to 1 indicates that variable  $x_i$  has a more significant <sup>512</sup> effect on objective  $y_j$ , whereas the value closer to 0 indicates <sup>513</sup> that variable  $x_i$  has a less significant effect on variable  $y_j$ .

For multiobjective optimization, the importance of different 515 objectives should be integrated. In order to obtain a better 516 PF, we hope to improve the shortcomings of the current 517 optimal results. So, the weight to the *j*th objective  $w_j$  and the 518 importance value  $V_i$  of the *i*th variable are defined as

519 
$$w_j = \frac{y_{j_{\text{best}}} - y_{j_{\min}}}{y_{j_{\text{avg}}}}, \quad j = 1, \dots, M$$
 (17)

$$V_i = \sum_{j=1}^{m} (w_j \cdot s_{ij}) + C_p \sqrt{1/n_i}$$

520

$$= \sum_{j=1}^{m} (w_j \cdot s_{ij}) + C_p \sqrt{1/n_i}$$
(18)

<sup>521</sup> where  $y_{j_{\text{best}}}$  is the *j*th objective value of the Pareto point <sup>522</sup> with the largest HVC inside the ROI,  $y_{j_{\min}}$  is the minimum <sup>523</sup> *j*th objective value among all points,  $y_{j_{avg}}$  is the average *j*th <sup>524</sup> objective value of the whole dataset,  $n_i$  is the number that the <sup>525</sup> variable  $x_i$  has been selected as the important variable, and  $C_p$ <sup>526</sup> is a hyper-parameter to make a tradeoff between exploitation <sup>527</sup> and exploration. For microarchitecture DSE,  $C_p$  is set as 0.01. <sup>528</sup> A variable is considered important if its importance value  $V_i$ <sup>529</sup> surpasses the threshold  $V_{th}$  which is determined by the product <sup>530</sup> of a hyper-parameter *p* and the maximum importance value <sup>531</sup> observed  $V_{\max}$ , as shown in (19) and Fig. 5

532 
$$D_{ipt} = \{i \mid i \in D, \ V_i \ge V_{th}\}, \ V_{th} = p \cdot V_{max}$$
 (19)

<sup>533</sup> where  $D = \{i | i = 1, ..., n\}$ . Moreover, if the number of <sup>534</sup> important variables is less than a lower bound  $N_{ipt_L}$ , the largest <sup>535</sup>  $N_{ipt_L}$  variables are selected as important variables to ensure <sup>536</sup> the effectiveness of modeling. Conversely, if the number of <sup>537</sup> important variables exceeds an upper bound  $N_{ipt_U}$ , the  $N_{ipt_U}$ <sup>538</sup> variables with the highest-importance values are chosen as <sup>539</sup> important variables to avoid excessively long optimization <sup>540</sup> time. In this work, the number of important variables is con-<sup>541</sup> strained to be between 3 and 20. According to the theoretical <sup>542</sup> analysis in [27], the greater the importance of the unselected <sup>543</sup> variables, the higher the cumulative regret. Therefore, to enhance optimization efficiency, p in (19) is set to 0.5, thereby <sup>544</sup> limiting the importance of the unselected variables. <sup>545</sup>

Besides, unimportant variables are fixed to the values of the 546 Pareto points inside the current ROI. As there may be more 547 than one Pareto point, unimportant variables are transformed 548 into a discrete auxiliary variable  $var_{aux}$ , which is the index of 549 the closest Pareto point 550

$$var_{aux} = \arg\min_{\mathbf{x}} \|\mathbf{x} - \mathbf{x}_{PS_k}\|. \tag{20} \quad 55$$

We utilize both the important variables and the auxiliary <sup>552</sup> variable to construct models for subsequent optimization. <sup>553</sup>

#### D. Asynchronous Batch Multiobjective BO 554

To yield precise PPA results for RISC-V processors, a 555 VLSI flow is employed. However, a single VLSI flow can 566 take 1-6 h to complete. Since the simulations are independent, simulations can be run in parallel. However, various 558 design parameters result in different simulation time, so the 559 synchronous parallel/batch method must wait for the slowest 560 simulation in a batch to complete before starting a new 561 iteration. Consequently, an asynchronous batch strategy is 562 employed to reduce simulation time and further enhance 563 exploration efficiency. 564

The asynchronous parallelism means that a new 565 optimization iteration starts before all the simulation points 566 are returned. Thus, it is essential to carefully select candidate 567 points to avoid repeated simulations. The EHVI [22] 568 acquisition function is utilized to choose the candidate points. 569

When selecting candidates, if the simulation of  $x_i$  does not 570 return,  $x_i$  will be placed into the pending point set to create a 571 pseudo PS. The GPR models are left unchanged. At this time, 572 the EHVI value of  $x_i$  will become 0 according to the definition 573 of EHVI in (9). Then, the next candidate  $x_{i+1}$  will be obtained 574 by the EHVI acquisition function. So,  $x_{i+1}$  is distinct from 575 existing points. 576

Since the SOM network is unable to perform a reverse 577 transformation from the feature space to the design space, hillclimbing local search [18] is employed for acquisition function 579 optimization. When it comes to the auxiliary dimension, the 580 points in the PS are chosen as the corresponding discrete input 581 values to guarantee that the candidate points correspond to the 582 points in the original high-dimensional design space. 583

#### IV. EXPERIMENTAL RESULTS

584

585

#### A. Experimental Setup

We explore the design spaces for a RISC-V BOOM [6] 566 core and a Rocket core [7]. Both of them are RV64GC 567 configurations. To assess the PPA results, all circuits in the 568 experiments are SoCs featuring either the BOOM core or 569 Rocket core as their processor cores. The peripheral modules 590 of these SoCs are standard components provided by the 591 Chipyard [3] framework. A 7nm ASAP7 PDK [28] is utilized 592 in the VLSI flow. Since the designs based on ASAP7 are 593 not manufacturable, this article only achieves the front-end 594 design in the experiments. The RTL designs are generated 595 by the Chipyard framework, and the designs are simulated 596

TABLE I DESIGN PARAMETERS

| Name                  | Value                    | Name                                | Value                |  |  |
|-----------------------|--------------------------|-------------------------------------|----------------------|--|--|
|                       | OM (22)                  | Genus (31)                          |                      |  |  |
| BPD                   | Boom2, TAGEL, Alpha21264 | time_recovery_arcs                  | false,true           |  |  |
| fetchWidth            | 4,8                      | auto_partition                      | true,false           |  |  |
| numFetchBufferEntries | 8,16,24,32,35,40         | dp_analytical_opt                   | off,standard,extreme |  |  |
| numRasEntries         | 16,24,32                 | dp_area_mode                        | false,true           |  |  |
| maxBrCount            | 8,12,16,20               | dp_csa                              | basic,none           |  |  |
| decodeWidth           | 1,2,3,4,5                | dp_rewriting                        | basic,advanced,none  |  |  |
| numRobEntries         | 32,64,96,128,130         | dp_sharing                          | advanced,basic,none  |  |  |
| numIntPhysRegisters   | 48,64,80,96,112          | exact_match_seq_async_ctrls         | false,true           |  |  |
| memIssueWidth         | 1,2                      | exact_match_seq_sync_ctrls          | false,true           |  |  |
| intIssueWidth         | 1,2,3,4,5                | iopt_enable_floating_output_check   | false,true           |  |  |
| numLdqEntries         | 8,16,24,32               | iopt_force_constant_removal         | false,true           |  |  |
| enablePrefetching     | false,true               | iopt_lp_power_analysis_effort       | low,medium,high      |  |  |
| enableSFBOpt          | false,true               | lbr_seq_in_out_phase_opto           | false,true           |  |  |
| numRXQEntries         | 2,4,8                    | optimize_net_area                   | true,false           |  |  |
| numRCQEntries         | 8,16,32                  | retime_effort_level                 | low,medium,high      |  |  |
| nL2TLBEntries         | 128,256,512,1024         | retime_optimize_reset               | false,true           |  |  |
| nL2TLBWays            | 1,2,4,8                  | syn_generic_effort                  | low,medium,high      |  |  |
| nICacheWays           | 2,4,8                    | syn_map_effort                      | low,medium,high      |  |  |
| nICacheTLBWays        | 8,16,32                  | syn_opt_effort                      | low,medium,high      |  |  |
| nDCacheWays           | 2,4,8                    | leakage_power_effort                | low,medium,high      |  |  |
| nDCacheMSHRs          | 2,4,8                    | lp_power_analysis_effort            | low,medium,high      |  |  |
| nDCacheTLBWays        | 8,16,32                  | enable_latch_max_borrow             | false,true           |  |  |
|                       | ocket (9)                | latch_max_borrow                    | 1:500:1              |  |  |
| mulUnroll             | 1,2,4,8                  | enable_max_fanout                   | false,true           |  |  |
| divUnroll             | 1,2,4,8                  | max_fanout                          | 2:64:1               |  |  |
| mulEarlyOut           | false,true               | enable_lp_clock_gating_max_flops    | false,true           |  |  |
| divEarlyOut           | false,true               | lp_clock_gating_max_flops           | 8,16,32              |  |  |
| nDCacheWays           | 1,2,4                    | enable_lp_power_optimization_weight | false,true           |  |  |
| nDCacheTLBWays        | 4,8,16                   | lp_power_optimization_weight        | 0.1:0.9:0.1          |  |  |
| nDCacheMSHRs          | 1,2,4                    | max_dynamic_power                   | 3.0:150.0:0.1        |  |  |
| nICacheWays           | 1,2,4                    | max_leakage_power                   | 5.0:100.0:0.1        |  |  |
| nICacheTLBWays        | 4,8,16                   |                                     |                      |  |  |

<sup>597</sup> via an open-source simulator Verilator [10] to get CPI as the <sup>598</sup> performance metric. Cadence Genus Synthesis Solution 19.10 <sup>599</sup> is employed for logic synthesis. We set the clock frequency to 600 250 MHz for low-power embedded system applications. The 601 area and power are obtained by Genus. The power metric is 602 the total power. In this article, we refer to the time elapsed 603 from inputting design parameters to obtaining PPA results <sup>604</sup> as simulation time; the time spent running the algorithm is 605 referred to as optimization time. The sum of these two time 606 constitutes the total time.

To evaluate the performance of the designs, 13 benchmarks 607 608 are selected, including dhrystone, median, mm, mt-matmul, 609 mt-vvadd, multiply, qsort, rsort, spmv, towers, vvadd, fir2dim, 610 and whetstone. The first 11 benchmarks are from riscv-611 test,<sup>1</sup> a benchmark suite by RISC-V International designed to 612 evaluate RISC-V processor performance. These benchmarks 613 can well cover the RISC-V instructions and are employed in 614 BOOM-Explorer [13] and GRL-DSE [14]. BOOM-Explorer 615 also utilizes *fir2dim*<sup>2</sup> and *whetstone*<sup>3</sup> to evaluate digital signal 616 processing (DSP) and floating-point performance, respectively. 617 So, we adopt them for assessing our processor's capabilities. 618 The optimizations are to minimize the average CPI of bench-619 marks, power consumption, and area of the generated design. 620 All experiments take place on a Linux workstation equipped with 80 Intel Xeon Gold 6230 CPUs @2.10 GHz. 621

As shown in Table I, the design parameters encompass the 622 623 22 parameters of BOOM, the nine parameters of Rocket, and 624 the 31 parameters of Genus, all represented as discrete values. 625 Namely, the dimension numbers of design spaces for BOOM and Rocket are 53 and 40, respectively. The sizes of the design space for BOOM and Rocket are around  $6.96 \times 10^{32}$  and  $4.06 \times$ 628 10<sup>25</sup>, respectively.

We compare our ROI-HIT method with five state-of-the-art 629 630 methods: 1) BOOM-Explorer [13], a sequential BO method for

TABLE II ABLATION STUDY OF ROI-HIT BASED ON RISC-V BOOM MICROARCHITECTURE DSE

| Algorithm  | $ \begin{array}{c} \uparrow  \mathrm{HV} \\ (\times 10^5) \end{array} $ | Impro. | Opt.<br>Time | Speed<br>up    | # Sim | Multi.        |
|------------|-------------------------------------------------------------------------|--------|--------------|----------------|-------|---------------|
| Vanilla BO | 1.7240                                                                  | 0.00%  | 23493s       | $1.00 \times$  | 13    | $1.00 \times$ |
| BO+ROI     | 1.7291                                                                  | 0.30%  | 22990s       | $1.02 \times$  | 13    | $1.00 \times$ |
| BO+VS      | 1.7565                                                                  | 1.89%  | 324s         | 72.51 	imes    | 15    | $1.15 \times$ |
| ROI-HIT-1  | 1.7887                                                                  | 3.76%  | 876s         | $26.83 \times$ | 15    | $1.15 \times$ |
| ROI-HIT-4S | 1.8015                                                                  | 4.50%  | 2880s        | $8.16 \times$  | 44    | $3.38 \times$ |
| ROI-HIT-4  | 1.8379                                                                  | 6.61%  | /            | /              | 53    | 4.08 	imes    |

microarchitecture DSE based on an efficient initial sampling 631 method MicroAL; 2) GRL-DSE [14], another sequential BO 632 method for microarchitecture DSE that utilizes a GNN to map 633 the design space into graph embedding space; 3) BODi [18], 634 a BO method for high-dimensional discrete optimization 635 problem that employs dictionary embedding; 4) MORBO [16], 636 an efficient multiobjective BO method that is based on trust 637 regions; and 5) REMOTune [19], a recent BO method for high- 638 dimensional DSE that combines the trust regions and random 639 embedding. Since the MORBO and REMOTune support the 640 parallel simulation in a batch, these two methods are compared 641 in both sequential mode and batch mode. For clarity, we 642 label the batch size after the name of the algorithms. In these 643 experiments, the batch size is set as 4. All algorithms are 644 implemented in Python. 645

Due to the untraversable high-dimensional design space, 646 BOOM-Explorer and GRL-DSE randomly select 15 000 sam- 647 ple points from the entire design space to train MicroAL and 648 GNN. Additionally, we extend the original BODi from single- 649 objective optimization to multiobjective optimization by using 650 the EHVI acquisition function [22]. In order to gain an insight 651 into the high-dimensional design space, we set the initial 652 dataset size to 100. Since the points in the initial dataset are 653 independent, they can be simulated in parallel. Consequently, 654 through a 10-process parallel execution, the initial dataset can 655 be acquired in one day. For a fair comparison, all algorithms 656 are initiated with an identical set of designs sampled from 657 the design space utilizing the MicroAL algorithm in [13]. 658 The maximum time budget for the experiments is one day 659 (86400s). The reference point to calculate the HV is set as the 660 maximum CPI, power, and area values in the initial dataset. 661 The experiments are conducted five times, and the average 662 results are reported. 663

#### B. Ablation Study

To verify the effectiveness of the ROI acquisition method 665 and VS, we first conduct ablation experiments of our 666 ROI-HIT method on the RISC-V BOOM microarchitecture 667 DSE problem. We conduct a comparison between a vanilla 668 BO, a BO method with ROI (BO+ROI), a BO method based 669 on VS (BO+VS), a sequential ROI-HIT-1, a synchronous 670

The experimental results are presented in Table II. 1) Effectiveness of ROI and VS: ROIs offer a more promis- 673 ing local scope, while VS focuses on important variables, such 674 as DecodeWidth and IntIssueWidth, and mitigates interference 675

batch ROI-HIT-4S, and an asynchronous batch ROI-HIT-4. 671

664

672

<sup>&</sup>lt;sup>1</sup>https://github.com/riscv-software-src/riscv-tests

<sup>&</sup>lt;sup>2</sup>https://www.ice.rwth-aachen.de/research/tools-projects/closedprojects/dspstone

<sup>&</sup>lt;sup>3</sup>https://netlib.org/benchmark/whetstone.c



Fig. 6. Optimization results obtained by adjusting the EDA tool parameters while keeping the architecture parameters unchanged. The blue points denote the original official designs (*SmallBoom* and *MediumBoom* are from [6]. *BigRocket* and *MedRocket* are from [7].) The orange points denote the Pareto-optimal designs from optimizing the EDA tool (Genus) parameters by ROI-HIT-4. The original design and the optimized designs have the same architecture parameters, so their RTL netlists and CPIs are the same.

TABLE III Optimization Results and Corresponding Time for RISC-V BOOM Microarchitecture DSE

| Algorithm                          | BOOM-Explorer<br>[13] | GRL-DSE<br>[14]        | BODi<br>[18]           | MORBO-1<br>[16]        | REMOTune-1<br>[19]     | ROI-HIT-1                    | MORBO-4<br>[16]      | REMOTune-4<br>[19]     | ROI-HIT-4                    |
|------------------------------------|-----------------------|------------------------|------------------------|------------------------|------------------------|------------------------------|----------------------|------------------------|------------------------------|
| $\uparrow$ HV                      | 172391                | 172540                 | 172871                 | 172781                 | 172428                 | 178872                       | 175585               | 172551                 | 183795                       |
| Improvement                        | 0.00%                 | 0.09%                  | 0.28%                  | 0.23%                  | 0.02%                  | 3.76%                        | 1.85%                | 0.09%                  | 6.61%                        |
| $\downarrow \widehat{\text{ADRS}}$ | 35280                 | 55300                  | 47351                  | 54191                  | 50092                  | 29569                        | 37934                | 56007                  | 21538                        |
| Improvement                        | 0.00%                 | -56.75%                | -34.21%                | -53.60%                | -41.98%                | 16.19%                       | -7.52%               | -58.75%                | 38.95%                       |
| ↓ Min. CPI                         | 1.3919                | 0.9080                 | 0.9312                 | 0.7924                 | 1.0531                 | 0.8946                       | 0.8159               | 0.9501                 | 0.7820                       |
| Improvement                        | 0.00%                 | 34.77%                 | 33.10%                 | 43.07%                 | 24.34%                 | 35.73%                       | 41.38%               | 31.74%                 | 43.82%                       |
| ↓ Min. Power                       | 10.763mW              | 13.440mW               | 10.048mW               | 10.170mW               | 11.945mW               | 9.102mW                      | 9.341mW              | 12.274mW               | 8.978mW                      |
| Improvement                        | 0.00%                 | -24.88%                | 6.64%                  | 5.51%                  | -10.98%                | 15.43%                       | 13.21%               | -14.04%                | 16.58%                       |
| ↓ Min. Area                        | 0.0688mm <sup>2</sup> | 0.0772 mm <sup>2</sup> | 0.0710 mm <sup>2</sup> | 0.0709 mm <sup>2</sup> | 0.0734 mm <sup>2</sup> | <b>0.0688mm</b> <sup>2</sup> | $0.0692 \text{mm}^2$ | 0.0743 mm <sup>2</sup> | <b>0.0684mm</b> <sup>2</sup> |
| Improvement                        | 0.00%                 | -12.16%                | -3.20%                 | -2.93%                 | -6.59%                 | 0.08%                        | -0.58%               | -7.97%                 | 0.65%                        |
| Opt. Time                          | 33065s                | 4291s                  | 10456s                 | 1175s                  | 1227s                  | 876s                         | 1416s                | 1152s                  | /                            |
| Speed up                           | $1.00 \times$         | $7.71 \times$          | $3.16 \times$          | $28.13 \times$         | $26.94 \times$         | 37.76×                       | $23.35 \times$       | $28.70 \times$         | /                            |
| # Sim                              | 11                    | 8                      | 11                     | 14                     | 13                     | 15                           | 52                   | 42                     | 53                           |
| Multiple                           | $1.00 \times$         | $0.73 \times$          | $1.00 \times$          | $1.27 \times$          | $1.18 \times$          | 1.36 	imes                   | $4.73 \times$        | $3.82 \times$          | 4.82 	imes                   |

676 from noncritical dimensions during ROI-driven optimization. 677 ROIs and VS improve, respectively, 0.30% and 1.89% HV 678 compared with vanilla BO. The combination of ROIs and VS 679 thus greatly enhances the results of ROI-HIT. The sequential 680 ROI-HIT-1 achieves 3.76%, 3.45%, and 1.83% improvement 681 in HV compared with vanilla BO, BO with ROI, and BO 662 with VS.

2) Effectiveness of Batch Simulation: The synchronous and asynchronous batch simulation methods, through increased exploration points, further enhance the exploration results within the limited time budget. The asynchronous ROI-HIT-4 with a batch of 4 achieves, respectively, 6.61%, 2.75%, and 2.02% improvement in HV compared with vanilla BO, sequential ROI-HIT-1, and synchronous ROI-HIT-4S. By exploiting the idle time gap between different simulations in a batch, asynchronous ROI-HIT-4 explores  $1.20\times$  more points than synchronous ROI-HIT-4S.

3) Optimization Time: The VS notably reduces the 693 optimization time by significantly reducing the dimensions and 694 obtains a speedup of  $72.51 \times$  in optimization time compared to 695 vanilla BO. Benefiting from the VS, the sequential ROI-HIT-1 696 achieves optimization time speedups of 26.83× compared to 697 vanilla BO. However, due to asynchronous batch simulation, 698 where simulation time overlaps with optimization time, it is 699 <sup>700</sup> not feasible to quantify the optimization time for asynchronous 701 ROI-HIT-4.



Fig. 7. Optimization results for RISC-V BOOM microarchitecture DSE.

4) *Effect of EDA Tool Parameters:* Although the CPI of the 702 design is determined by the RTL netlist generated by the archi-703 tecture parameters, differing EDA tool configurations result 704 in variations in power and area. This is illustrated in Fig. 6, 705 where the blue points represent the power and area resulting 706 from logic synthesis using default parameters, while the orange 707 points represent the PF achieved by utilizing ROI-HIT-4 to 708 optimize EDA tool parameters. The experiment suggests that 709 optimizing EDA tool parameters achieves 21.81%–28.77% 710 improvement in power and 0.79%–1.68% improvement in 711 area.

## C. RISC-V BOOM Microarchitecture DSE 713

The optimization results are shown in Table III and Fig. 7. 714 Min. CPI, Min. Power, and Min. Area in the table denote 715



Fig. 8. Optimization results for rocket microarchitecture DSE.

<sup>716</sup> the minimum values of CPI, power, and area achieved by the
<sup>717</sup> optimization. In addition to HV, we also use average distance
<sup>718</sup> to reference set (ADRS) to provide a comprehensive evaluation
<sup>719</sup> of optimization results

ADRS(
$$\Gamma, \Omega$$
) =  $\frac{1}{|\Gamma|} \sum_{\gamma \in \Gamma} \min_{\omega \in \Omega} d(\gamma, \omega)$  (21)

<sub>721</sub> where d is the Euclidean distance function.  $\Gamma$  is the real PF  $_{722}$  and  $\Omega$  is the optimized PF. However, due to the vastness of 723 our design space, it is not feasible to obtain a real PF. To address this issue, we consider all the points evaluated in the 724 725 experiments as the total dataset, utilizing the PF derived from 726 it as the reference set to calculate the ADRS. Compared with 727 the state-of-the-art methods, ROI-HIT obtains the largest HV and the least ADRS in both sequential and batch mode. The 728 sequential ROI-HIT-1 achieves 3.47%-3.76% improvement in 729 730 HV compared with sequential baselines, while asynchronous batch ROI-HIT-4 achieves 4.67% and 6.51% improvement 731 732 in HV compared with MORBO-4 and REMOTune-4. ROI-733 HIT-1 improves ADRS by 16.19%-46.53% over sequential <sub>734</sub> baselines, while ROI-HIT-4 improves  $\widehat{ADRS}$  by 43.22% and 735 61.54% over MORBO-4 and REMOTune-4. In terms of PPA metrics, ROI-HIT-4 gains improvements of 4.16%-43.82% in 736 CPI, 3.88%–33.20% in power, and 0.65%–11.41% in area, 737 compared to other methods. Besides, ROI-HIT-1 takes the 738 739 least optimization time among all methods, and ROI-HITexplores the most points within the limited time budget. 4 740 As shown in Fig. 7, compared to state-of-the-art methods, 741 ROI-HIT-4 achieves a speed-up of  $3.58 \times -13.52 \times$  in reaching 742 743 the maximum HV that the baselines achieve. Fig. 9(a) shows 744 the PFs obtained from a sole run of MORBO-4, REMOTune-745 4, and ROI-HIT-4. ROI-HIT-4 gains the best PF compared to 746 MORBO-4 and REMOTune-4.

To verify the effectiveness of the optimization, we compare 747 748 the Pareto-optimal designs obtained from the experiment 749 with the official BOOM designs (Small BOOM and Medium 750 BOOM) from [6]. As Table IV shows, ROI-HIT-4 achieves 751 improvements of 19.21% in CPI, 25.52% in power, and 1.61% in area compared to Small BOOM, while ROI-HIT-4 752 753 achieves improvements of 0.30% in CPI, 30.05% in power, and 2.71% in area compared to Medium BOOM. Through the 754 VS, we identify the eight most important design parameters: 1) <sup>756</sup> decodeWidth<sup>B</sup>; 2) BPD<sup>B</sup>; 3) intIssueWidth<sup>B</sup>; 4) fetchWidth<sup>B</sup>; numIntPhysRegisters<sup>B</sup>; 6) memIssueWidth<sup>B</sup>; 757 5) 7)



Fig. 9. PFs obtained by, respectively, MORBO-4 [16], REMOTune-4 [19], and our proposed ROI-HIT-4. (a) BOOM. (b) Rocket.

numRasEntries<sup>B</sup>; and 8) leakage\_power\_effort<sup>G</sup>. Parameters <sup>758</sup> with a superscript B belong to BOOM parameters, while <sup>759</sup> the parameter with a superscript G belongs to Genus <sup>760</sup> parameters. Increasing the decodeWidth<sup>B</sup> can significantly <sup>761</sup> improve CPI but at the cost of higher-power consumption <sup>762</sup> and area. Similarly, the choice of prediction branch (BPD<sup>B</sup>) <sup>763</sup> affects CPI, power consumption, and area. In terms of <sup>764</sup> EDA tool parameters, logic synthesis with a higher- <sup>765</sup> leakage\_power\_effort<sup>G</sup> yields significantly better-power <sup>766</sup> optimization results compared to the official design at the <sup>767</sup> default setting. <sup>768</sup>

769

#### D. Rocket Microarchitecture DSE

The optimization results are shown in Table V and Fig. 8. 770 Compared with the state-of-the-art methods, ROI-HIT obtains 771 the largest HV and the least ADRS in both sequential and 772 batch mode. The ROI-HIT-1 improves HV by 5.49%-6.54% 773 over sequential baselines, while ROI-HIT-4 achieves HVIs of 774 8.24% and 7.72% compared to MORBO-4 and REMOTune-4. 775 ROI-HIT-1 improves ADRS by 39.11%-46.97% over sequen- 776 tial baselines, while ROI-HIT-4 improves ADRS by 63.95% 777 and 59.53% over MORBO-4 and REMOTune-4. In terms of 778 PPA metrics. ROI-HIT-4 achieves improvements of 0.00%- 779 16.27% in CPI, 3.87%-22.72% in power, and 0.16%-0.57% 780 in area, compared to other methods. Besides, ROI-HIT-1 takes 781 the least optimization time among all methods, and ROI-HIT- 782 4 explores the most points within the limited time budget. As 783 shown in Fig. 8, compared to state-of-the-art methods, ROI- 784 HIT-4 achieves a speed-up of  $16.13 \times -17.23 \times$  in reaching the 785 maximum HV that the baselines achieve. Fig. 9(b) shows the 786 PFs obtained from a sole run of MORBO-4, REMOTune-4, 787 and ROI-HIT-4. ROI-HIT-4 gains the best PF compared to 788 MORBO-4 and REMOTune-4. 789

We also compare the Pareto-optimal designs obtained from 790 the experiment with the official Rocket designs (*BigCore* 791 and *MedCore*) from [7]. As Table VI shows, ROI-HIT-4 792 improves CPI by 0.32%, power by 18.13%, and area by 0.81% 793 compared with BigCore, while ROI-HIT-4 improves CPI by 794 18.04%, power by 25.32%, and area by 0.01% compared with 795 MedCore. Through the VS, we identify the eight most important design parameters: 1) nDCacheWays<sup>R</sup>; 2) divUnroll<sup>R</sup>; 797 3) nICacheWays<sup>R</sup>; 4) mulUnroll<sup>R</sup>; 5) syn\_opt\_effort<sup>G</sup>; 796 6) leakage\_power\_effort<sup>G</sup>; 7) nICacheTLBWays<sup>R</sup>; and 8) 799 dp\_analytical\_opt<sup>G</sup>. Parameters with a superscript R belong 800 to Rocket parameters, while the parameters with a superscript 801

TABLE IV PARETO-OPTIMAL DESIGNS FOR RISC-V BOOM MICROARCHITECTURE

| Design           | Important Variables*     | $\downarrow$ CPI | Improvement | ↓ Power  | Improvement | ↓ Area                 | Improvement |
|------------------|--------------------------|------------------|-------------|----------|-------------|------------------------|-------------|
| Small BOOM [6]   | [0, 1, 0, 0, 0, 0, 2, 0] | 1.3725           | 0.00%       | 13.844mW | 0.00%       | 0.0722 mm <sup>2</sup> | 0.00%       |
| MORBO-4 [16]     | [1, 1, 1, 1, 1, 1, 1, 1] | 1.0719           | 21.90%      | 11.587mW | 16.30%      | $0.0735 \text{mm}^2$   | -1.74%      |
| REMOTune-4 [19]  | [1, 0, 1, 0, 1, 0, 1, 2] | 1.0792           | 21.37%      | 10.656mW | 23.03%      | $0.0718 { m mm}^2$     | 0.49%       |
| <b>ROI-HIT-4</b> | [1, 0, 1, 0, 1, 0, 0, 2] | 1.1088           | 19.21%      | 10.311mW | 25.52%      | 0.0710 mm <sup>2</sup> | 1.61%       |
| Medium BOOM [6]  | [1, 1, 1, 0, 2, 0, 2, 0] | 0.9906           | 0.00%       | 17.018mW | 0.00%       | $0.0776 \text{mm}^2$   | 0.00%       |
| MORBO-4 [16]     | [1, 1, 1, 0, 1, 0, 0, 2] | 0.9575           | 3.35%       | 13.579mW | 20.21%      | $0.0772 \text{mm}^2$   | 0.44%       |
| REMOTune-4 [19]  | [2, 1, 2, 0, 1, 0, 0, 2] | 0.9488           | 4.23%       | 15.374mW | 9.66%       | $0.0790 \text{mm}^2$   | -1.85%      |
| ROI-HIT-4        | [1, 0, 2, 1, 2, 0, 0, 2] | 0.9877           | 0.30%       | 11.905mW | 30.05%      | 0.0755 mm <sup>2</sup> | 2.71%       |

<sup>\*</sup> Important Variables are the 8 most important design parameters obtained by variable selection. They are decodeWidth<sup>B</sup>, BPD<sup>B</sup>, intIssueWidth<sup>B</sup>, fetchWidth<sup>B</sup>, numIntPhysRegisters<sup>B</sup>, memIssueWidth<sup>B</sup>, numRasEntries<sup>B</sup>, and leakage\_power\_effort<sup>G</sup>. Parameters with a superscript B belong to BOOM parameters, while the parameter with a superscript G belongs to Genus parameters.

TABLE V Optimization Results and Corresponding Time for Rocket Microarchitecture DSE

| Algorithm                            | BOOM-Explorer<br>[13]  | GRL-DSE<br>[14]        | BODi<br>[18]           | MORBO-1<br>[16]      | REMOTune-1<br>[19]     | ROI-HIT-1                    | MORBO-4<br>[16]       | REMOTune-4<br>[19]   | ROI-HIT-4                    |
|--------------------------------------|------------------------|------------------------|------------------------|----------------------|------------------------|------------------------------|-----------------------|----------------------|------------------------------|
| $\uparrow$ HV                        | 6451                   | 6475                   | 6497                   | 6452                 | 6515                   | 6872                         | 6507                  | 6539                 | 7043                         |
| Improvement                          | 0.00%                  | 0.38%                  | 0.72%                  | 0.03%                | 1.00%                  | 6.54%                        | 0.87%                 | 1.37%                | 9.19%                        |
| $\downarrow \widehat{\mathrm{ADRS}}$ | 2457                   | 2140                   | 2299                   | 2449                 | 2146                   | 1303                         | 2172                  | 1935                 | 783                          |
| Improvement                          | 0.00%                  | 12.90%                 | 6.43%                  | 0.33%                | 12.66%                 | 46.97%                       | 11.60%                | 21.25%               | 68.13%                       |
| ↓ Min. CPI                           | 1.9192                 | 1.6104                 | 1.6121                 | 1.6070               | 1.6828                 | 1.6163                       | 1.6070                | 1.6768               | 1.6070                       |
| Improvement                          | 0.00%                  | 16.09%                 | 16.00%                 | 16.27%               | 12.32%                 | 15.78%                       | 16.27%                | 12.63%               | 16.27%                       |
| ↓ Min. Power                         | 4.756mW                | 3.935mW                | 3.919mW                | 3.823mW              | 3.877mW                | 3.735mW                      | 3.823mW               | 4.108mW              | 3.675mW                      |
| Improvement                          | 0.00%                  | 17.26%                 | 17.60%                 | 19.61%               | 18.48%                 | 21.46%                       | 19.61%                | 13.63%               | 22.72%                       |
| ↓ Min. Area                          | 0.0579 mm <sup>2</sup> | 0.0580 mm <sup>2</sup> | 0.0580 mm <sup>2</sup> | $0.0578 \text{mm}^2$ | 0.0580 mm <sup>2</sup> | <b>0.0577mm</b> <sup>2</sup> | 0.0578mm <sup>2</sup> | $0.0578 \text{mm}^2$ | <b>0.0577mm</b> <sup>2</sup> |
| Improvement                          | 0.00%                  | -0.25%                 | -0.11%                 | 0.12%                | -0.26%                 | 0.32%                        | 0.16%                 | 0.07%                | 0.32%                        |
| Opt. Time                            | 50062s                 | 14495s                 | 20977s                 | 2424s                | 3508s                  | 435s                         | 3717s                 | 3064s                | /                            |
| Speed up                             | $1.00 \times$          | $3.45 \times$          | $2.39 \times$          | $20.65 \times$       | $14.27 \times$         | 115.02 	imes                 | 13.47×                | $16.34 \times$       | /                            |
| # Sim                                | 16                     | 22                     | 21                     | 27                   | 29                     | 28                           | 96                    | 90                   | 98                           |
| Multiple                             | $1.00 \times$          | $1.38 \times$          | $1.31 \times$          | $1.69 \times$        | 1.81	imes              | $1.75 \times$                | $6.00 \times$         | $5.63 \times$        | 6.13×                        |

TABLE VI Pareto-Optimal Designs for Rocket Microarchitecture

| Design          | Important Variables*        | ↓ CPI  | Improvement | ↓ Power | Improvement | ↓ Area                       | Improvement |
|-----------------|-----------------------------|--------|-------------|---------|-------------|------------------------------|-------------|
| BigCore [7]     | [2, 0, 0, 3, 0, 0, 2, 0]    | 1.6121 | 0.00%       | 5.309mW | 0.00%       | 0.0608 mm <sup>2</sup>       | 0.00%       |
| MORBO-4 [16]    | [2, 1, 0, 2, 2, 1, 2, 2]    | 1.6239 | -0.73%      | 4.914mW | 7.45%       | 0.0594 mm <sup>2</sup>       | 2.32%       |
| REMOTune-4 [19] | [1, 0, 1, 3, 1, 0, 2, 0]    | 1.6919 | -4.95%      | 4.530mW | 14.67%      | <b>0.0586mm</b> <sup>2</sup> | 3.57%       |
| ROI-HIT-4       | [2, 1, 1, 2, 2, 2, 0, 0]    | 1.6070 | 0.32%       | 4.347mW | 18.13%      | $0.0603 \text{mm}^2$         | 0.81%       |
| MedCore [7]     | [0, 0, 0, 0, 0, 0, 0, 0, 0] | 2.2670 | 0.00%       | 5.064mW | 0.00%       | 0.0579 mm <sup>2</sup>       | 0.00%       |
| MORBO-4 [16]    | [0, 1, 0, 2, 1, 2, 0, 0]    | 1.9615 | 13.48%      | 3.823mW | 24.51%      | <b>0.0579mm</b> <sup>2</sup> | 0.13%       |
| REMOTune-4 [19] | [0, 1, 0, 2, 2, 2, 0, 1]    | 1.8823 | 16.97%      | 3.870mW | 23.59%      | 0.0581 mm <sup>2</sup>       | -0.28%      |
| ROI-HIT-4       | [0, 0, 0, 3, 1, 2, 1, 1]    | 1.8581 | 18.04%      | 3.782mW | 25.32%      | 0.0579 mm <sup>2</sup>       | 0.01%       |

<sup>&</sup>lt;sup>\*</sup> Important Variables are the 8 most important design parameters obtained by variable selection. They are nDCacheWays<sup>R</sup>, divUnroll<sup>R</sup>, nICacheWays<sup>R</sup>, mulUnroll<sup>R</sup>, syn\_opt\_effort<sup>G</sup>, leakage\_power\_effort<sup>G</sup>, nICacheTLBWays<sup>R</sup>, and dp\_analytical\_opt<sup>G</sup>. Parameters with a superscript R belong to Rocket parameters, while the parameters with a superscript G belong to Genus parameters.

<sup>802</sup> G belong to Genus parameters. Increasing the number of <sup>803</sup> nWays<sup>R</sup> and Unroll<sup>R</sup> improves CPI, but also increases power <sup>804</sup> consumption and area. But, logic synthesis tool parameters can <sup>805</sup> be adjusted to mitigate these tradeoffs, resulting in a design <sup>806</sup> that achieves better-power consumption and area results.

#### 807 E. Discussion

<sup>808</sup> Due to the time-consuming optimization of DKL-GP in <sup>809</sup> high dimensions, BOOM-Explorer's optimization time signifi-<sup>810</sup> cantly lags behind other methods. GRL-DSE exhibits mediocre <sup>811</sup> HV because training an accurate GNN for the entire high-<sup>812</sup> dimensional design space is challenging. BODi outperforms <sup>813</sup> BOOM-Explorer and GRL-DSE by employing dictionary <sup>814</sup> embedding, which is more suitable for high-dimensional <sup>815</sup> problems. However, it disregards the varying importance of design parameters, resulting in inferior performance compared to ROI-HIT. BODi's hill-climbing local search requires more acquisition evaluations than sampling methods in GRL-DSE, MORBO, and REMOTune, leading to a longer optimization time. Although MORBO and REMOTune are designed for high-dimensional multiobjective optimization, they achieve worse results than ROI-HIT in both sequential and batch modes because they ignore the relationship between design parameters and PPA results, ignoring the different importance of variables. Additionally, these trust region-based methods exhibit limited robustness, as demonstrated in experiments for RISC-V BOOM and Rocket microarchitectures.

Our proposed ROI-HIT achieves better PPA results by 829 focusing on promising ROIs and important variables. The 830 asynchronous batch technique in ROI-HIT enables it to explore 831 837

more points within the limited time budget. Moreover, our
approach provides designers with valuable insights into RISCV SoCs by identifying the important variables (key design
parameters). This information is beneficial for microarchitecture DSE.

## V. CONCLUSION

In this article, we propose an ROI-driven microarchitec-838 839 ture DSE method called ROI-HIT. This method exploits 840 ROIs acquired by a SOM Tree and then prunes unimportant variables based on a sensitivity matrix. By optimizing 841 842 inside the more promising ROIs and reduced dimensions, <sup>843</sup> this approach greatly improves the exploration efficiency <sup>844</sup> in high-dimensional DSE. For time-consuming VLSI flow 845 simulation, an asynchronous parallel strategy is employed 846 to make ROI-HIT explore more points in the limited time <sup>847</sup> budget. Experimental results conducted on the RISC-V BOOM 848 and Rocket microarchitectures demonstrate that our proposed 849 method achieves a 3.47%-9.19% improvement of HV within  $_{850}$  the same time budget and  $3.58 \times -341.68 \times$  speed-up of time <sup>851</sup> in reaching the same HV compared to the state-of-the-852 art methods. In terms of PPA metrics, ROI-HIT achieves <sup>853</sup> improvements of up to 43.82% in performance, 33.20% 854 in power consumption, and 11.41% in area. By leveraging 855 SOMT-based space partitioning, sensitivity matrix-based VS, 856 and an asynchronous parallel strategy, ROI-HIT can effec-<sup>857</sup> tively handle black-box optimization problems with large 858 parameter space, high dimensionality, and expensive simula-859 tions. We believe that ROI-HIT can find broader applications 860 in future research. The code of ROI-HIT is available at <sup>861</sup> https://github.com/zxy6541/ROI-HIT.

#### 862

#### REFERENCES

- [1] A. D. Pimentel, "Exploring exploration: A tutorial introduction to
   embedded systems design space exploration," *IEEE Design Test*, vol. 34,
   no. 1, pp. 77–90, Feb. 2017.
- [2] S. Greengard, "Will RISC-V revolutionize computing?" *Commun. ACM*, vol. 63, no. 5, pp. 30–32, 2020.
- [3] A. Amid et al., "Chipyard: Integrated design, simulation, and implementation framework for custom SoCs," *IEEE Micro*, vol. 40, no. 4, pp. 10–21, Aug. 2020.
- [4] J. Bachrach et al., "Chisel: Constructing hardware in a scala
  embedded language," in *Proc. Design Autom. Conf. (DAC)*, 2012,
  pp. 1212–1221.
- [5] C. Celio, D. A. Patterson, and K. Asanovic, "The Berkeley outof-order machine (boom): An industry-competitive, synthesizable, parameterized RISC-V processor," Dept. Elect. Eng. Comput. Sci., Univ. California, Berkeley, CA, USA, Rep. UCB/EECS-2015-167, 2015.
- [6] J. Zhao, B. Korpan, A. Gonzalez, and K. Asanovic, "SonicBOOM: The
   3rd generation Berkeley out-of-order machine," in *Proc. 4th Workshop Comput. Archit. Res. RISC-V*, 2020, pp. 1–7.
- [7] K. Asanovic et al., "The rocket chip generator," Dept. Elect. Eng.
   Comput. Sci., Univ. California, Berkeley, CA, USA, Rep. UCB/EECS 2016-17, 2016.
- [8] N. Binkert et al., "The gem5 simulator," ACM SIGARCH Comput. Archit.
   News, vol. 39, no. 2, pp. 1–7, 2011.

- H. Liew et al., "Hammer: A modular and reusable physical design 887 flow tool," in *Proc. 59th ACM/IEEE Design Autom. Conf. (DAC)*, 2022, 888 pp. 1335–1338.
- [10] W. Snyder, "Verilator 4.0: Open simulation goes multithreaded," 890 in Proc. Open Source Digit. Design Conf. (ORConf), 2018, 891 pp. 1–31.
- K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, "A fast and elitist s93 multiobjective genetic algorithm: NSGA-II," *IEEE Trans. Evol. Comput.*, 894 vol. 6, no. 2, pp. 182–197, Apr. 2002.
- [12] Q. Zhang and H. Li, "MOEA/D: A multiobjective evolutionary algorithm see based on decomposition," *IEEE Trans. Evol. Comput.*, vol. 11, no. 6, 897 pp. 712–731, Dec. 2007.
- [13] C. Bai, Q. Sun, J. Zhai, Y. Ma, B. Yu, and M. D. Wong, "BOOMexplorer: RISC-V BOOM microarchitecture design space exploration framework," in *Proc. IEEE/ACM Int. Conf. Comput. Aided Design* 901 (*ICCAD*), 2021, pp. 1–9. 902
- [14] X. Yi, J. Lu, X. Xiong, D. Xu, L. Shang, and F. Yang, "Graph 903 representation learning for microarchitecture design space explo- 904 ration," in *Proc. 60th ACM/IEEE Design Autom. Conf. (DAC)*, 2023, 905 pp. 1–6. 906
- [15] D. Eriksson, M. Pearce, J. Gardner, R. D. Turner, and M. Poloczek, 907 "Scalable global optimization via local Bayesian optimization," in 908 Proc. 33rd Conf. Neural Inf. Process. Syst. (NeurIPS), 2019, 909 pp. 1–12. 910
- [16] S. Daulton, D. Eriksson, M. Balandat, and E. Bakshy, "Multiobjective Bayesian optimization over high-dimensional search 912 spaces," in *Proc. 38th Conf. Uncertainty Artif. Intell. (UAI)*, 2022, 913 pp. 507–517. 914
- Z. Wang, F. Hutter, M. Zoghi, D. Matheson, and N. De Feitas, "Bayesian 915 optimization in a billion dimensions via random embeddings," *J. Artif.* 916 *Intell. Res.*, vol. 55, no. 1, pp. 361–387, 2016.
- [18] A. Deshwal, S. Ament, M. Balandat, E. Bakshy, J. R. Doppa, and 918 D. Eriksson, "Bayesian optimization over high-dimensional combinatorial spaces via dictionary-based embeddings," in *Proc. Int. Conf. Artif.* 920 *Intell. Statist. (AISTATS)*, 2023, pp. 7021–7039. 921
- [19] S. Zheng, H. Geng, C. Bai, B. Yu, and M. D. Wong, "Boosting 922 VLSI design flow parameter tuning with random embedding and 923 multi-objective trust-region Bayesian optimization," ACM Trans. Design 924 Autom. Electron. Syst., vol. 28, no. 5, pp. 1–23, 2023. 925
- [20] C. E. Rasmussen, "Gaussian processes in machine learning," in 926 Advanced Lectures on Machine Learning. Tübingen, Germany: Springer, 927 2003. 928
- [21] M. T. Emmerich, K. C. Giannakoglou, and B. Naujoks, "Single-and 929 multiobjective evolutionary optimization assisted by Gaussian random field metamodels," *IEEE Trans. Evol. Comput.*, vol. 10, no. 4, 931 pp. 421–439, Aug. 2006. 932
- [22] S. Daulton, M. Balandat, and E. Bakshy, "Parallel Bayesian optimization 933 of multiple noisy objectives with expected hypervolume improve-934 ment," in *Advances in Neural Information Processing Systems*, 935 vol. 34, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and 936 J. W. Vaughan, Eds. Red Hook, NY, USA: Curran Associates, Inc., 2021, 937 pp. 2187–2200.
- [23] T. Kohonen, "The self-organizing map," Proc. IEEE, vol. 78, no. 9, 939 pp. 1464–1480, Sep. 1990. 940
- [24] T. Kohonen, "Essentials of the self-organizing map," *Neural Netw.*, 941 vol. 37, pp. 52–65, Jan. 2013.
- [25] A. Zhao et al., "D3PBO: Dynamic domain decomposition-based 943 parallel Bayesian optimization for large-scale analog circuit sizing," 944 ACM Trans. Design Autom. Electron. Syst., vol. 29, no. 3, pp. 1–25, 945 2024. 946
- [26] L. Adjengue, C. Audet, and I. Ben Yahia, "A variance-based method 947 to rank input variables of the mesh adaptive direct search algorithm," 948 *Optim. Lett.*, vol. 8, pp. 1599–1610, Jun. 2014. 949
- [27] L. Song, K. Xue, X. Huang, and C. Qian, "Monte carlo tree search 950 based variable selection for high dimensional Bayesian optimization," 951 in *Proc. 36th Conf. Neural Inf. Process. Syst. (NeurIPS)*, 2022, 952 pp. 28488–28501. 953
- [28] L. T. Clark et al., "ASAP7: A 7-nm finFET predictive process design 954 kit," *Microelectron. J.*, vol. 53, pp. 105–115, Jul. 2016. 955