# GEAR: Graph-Evolving Aware Data Arranger to Enhance the Performance of Traversing Evolving Graphs on SCM

Wen-Yi Wang, Chun-Feng Wu<sup>®</sup>, *Member, IEEE*, Yun-Chih Chen<sup>®</sup>, *Member, IEEE*, Tei-Wei Kuo<sup>®</sup>, *Fellow, IEEE*, and Yuan-Hao Chang<sup>®</sup>, *Fellow, IEEE* 

Abstract-In the era of big data, social network services 2 continuously modify social connections, leading to dynamic and <sup>3</sup> evolving graph data structures. These evolving graphs, vital 4 for representing social relationships, pose significant memory 5 challenges as they grow over time. To address this, storage-class-6 memory (SCM) emerges as a cost-effective solution alongside 7 DRAM. However, contemporary graph evolution processes often <sup>8</sup> scatter neighboring vertices across multiple pages, causing weak 9 graph spatial locality and high-TLB misses during traversals. 10 This article introduces SCM-Based graph-evolving aware data 11 arranger (GEAR), a joint management middleware optimizing 12 data arrangement on SCMs to enhance graph traversal effi-13 ciency. SCM-based GEAR comprises multilevel page allocation, 14 locality-aware data placement, and dual-granularity wear lev-15 eling techniques. Multilevel page allocation prevents scattering 16 of neighbor vertices relying on managing each page in a finer-17 granularity, while locality-aware data placement reserves space 18 for future updates, maintaining strong graph spatial locality. The <sup>19</sup> dual-granularity wear leveler evenly distributes updates across 20 SCM pages with considering graph traversing characteristics. 21 Evaluation results demonstrate SCM-based GEAR's superiority, 22 achieving 23% to 70% reduction in traversal time compared to 23 state-of-the-art frameworks.

Index Terms—Checkpointing, evolving graph, graph, HW/SW
 Co-design, memory management, middleware, non-volatile
 memory, system software.

Manuscript received 4 August 2024; accepted 4 August 2024. This work was supported in part by the National Science and Technology Council under Grant 113-2628-E-A49-021, Grant 112-2222-E-A49-002-MY2, Grant 113-2223-E-001-001, Grant 111-2221-E-001-013-MY3, Grant 113-2927-I-001-502, and Grant 111-2923-E-002-014-MY3; in part by the Academia Sinica under Grant AS-IA-111-M01; and in part by the Ministry of Education under Yushan Young Fellow Program. This article was presented in the International Conference on 2024 and appears as part of the ESWEEK-TCAD special issue. This article was recommended by Associate Editor S. Dailey. (*Corresponding authors: Chun-Feng Wu; Yuan-Hao Chang.*)

Wen-Yi Wang is with the Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan (e-mail: r09922090@ntu.edu.tw).

Chun-Feng Wu is with the Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan (e-mail: cfwu417@cs.nycu.edu.tw).

Yun-Chih Chen is with the Department of Computer Science, Technische Universitat Dortmund, 44227 Dortmund, Germany (e-mail: yunchih.chen@ tu-dortmund.de).

Tei-Wei Kuo is with the Delta Electronics, Department of Computer Science and Information Engineering, High Performance and Scientific Computing Center, and the Center of Data Intelligence: Technologies, Applications, and Systems, National Taiwan University, Taipei 10617, Taiwan (e-mail: ktw@csie.ntu.edu.tw).

Yuan-Hao Chang is with the Institute of Information Science, Academia Sinica, Taipei 115, Taiwan (e-mail: johnson@iis.sinica.edu.tw).

Digital Object Identifier 10.1109/TCAD.2024.3447222

## I. INTRODUCTION

COCIAL network services utilize graph data structures to 28 manage the connections between users. The relationship is 29 highly dynamic, with connections added, updated, or removed 30 at all times [1], [2]. As a result, the underlying graph data <sup>31</sup> structures are dynamic and change with time. These dynamic 32 graphs are known as evolving graphs: the connection between 33 any two users may not be static all the time [3]. As evolving 34 graphs grow over time, so does the system's memory demand. 35 Storage-class memory (SCM) [4], [5], [6], [7] can augment 36 DRAM to provide larger memory space at a lower price, 37 alleviating the need to constantly add more DRAM to meet 38 memory demand. Recent works in graph processing propose to 39 buffer graph updates in RAM before flushing them to SCMs in 40 batch [8], [9]. Our investigation reveals that such batch updates 41 can be inefficient, with vertice updates spread across multiple 42 batches and neighboring vertices spread across multiple pages. 43 As a result, this characteristic leads to weak graph spatial 44 locality, which may result in high-translation lookaside buffer 45 (TLB) misses during subsequent graph traversals. To improve graph traversing performance, this work proposes a joint 47 management middleware that take graph spatial locality into 48 account in the data placement policy on SCMs. 49

Major social network providers, such as Google [10], 50 Meta [11], and JingDong (JD.com) [2], have adopted graph 51 processing algorithms, such as page rank and graph neural 52 networks, to extract information from Web pages and social 53 networks. A distributed system is one option for storing 54 all graph data in memory. However, building an efficient 55 distributed system remains a challenge, especially for small companies, due to high deployment and maintenance costs, 57 load balancing, and fault tolerance. Out-of-core systems are 58 alternative architectures that run graph processing on a sin-59 gle consumer-level machine, supplementing limited memory 60 capacity with storage devices. Graph processing system based 61 on out-of-core architecture have gained significant attention in 62 the community [12], [13]. GraphChi [14] proposed breaking 63 down large graphs into small parts and storing them in storage 64 devices. Several works (e.g., FlashGraph [15], Graphene [16], 65 and GraphSSD [17]) have proposed to carefully manage Solid-State Drives by adopting some I/O request merging 67 or sophisticated buffering approaches with considering graph 68 access behaviors. In contrast to processing static graphs, 69

1937-4151 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

<sup>70</sup> these systems need a new data structure to track the most
<sup>71</sup> recent version of each vertex and edge in evolving graphs.
<sup>72</sup> Section II will provide a comprehensive overview of cutting<sup>73</sup> edge solutions and challenges.

However, we observed that state-of-the-art evolving graph 74 75 frameworks have poor graph spatial locality, which makes 76 them inefficient in executing graph traversal algorithms. We 77 proposed a joint management middleware between graph-78 evolving processing and memory devices (including both 79 DRAM and SCMs), called the SCM-Based Graph-Evolving ware Data ArrangeR (GEAR). Our goal is to arrange and 80 A 81 write the evolving graph data into SCMs while achieving <sup>82</sup> strong graph spatial locality. The SCM-Based GEAR has three 83 major components: 1) multilevel page allocation; 2) locality-<sup>84</sup> aware data placement; and 3) a dual-granularity wear leveler. 1) The main idea behind multilevel page allocation is 85 to prevent graph-evolving processes from scattering 86 neighbor vertices across different pages. Technically, the 87 system maintains multilevel size subpages and assigns 88 a suitable-size subpage to accommodate all neighbors 89 associated with each vertex, taking into account their 90

2) The locality-aware data placement reserves an unused
 area in each subpage for future graph updates to the cor responding vertex, ensuring strong graph spatial locality
 even as the graph evolves over time.

3) The dual-granularity wear leveler, in conjunction with our page allocation, distributes graph updates evenly across all memory pages on SCM during graph evolution. The evaluation results show that, compared to the state-of-the-art frameworks, our SCM-based GEAR can save the total execution time by 23%-70% when traversing an evolving graph.

The remainder of this article is organized as follows. No4 Section II elaborates the graph evolving processes and shows to the impact of the weak graph spatial locality on the graph traversal time. Section III provides the design concept and no7 implementation of the SCM-based GEAR. Section IV evaluno8 ates the proposed strategy. Finally, Section V concludes this no9 article.

#### 110 II. BACKGROUND, OBSERVATION, AND MOTIVATION

#### 111 A. Background

number.

91

*1) Evolving Graphs:* Graphs are commonly used to repre- sent the relationship between data points. In general, each node in a graph represents a data point, and the edge that connects two data points (or nodes) records their relationship. A graph is considered a *evolving graph* if its layout or edge weights change over time. Social networks, for example, are constantly evolving [18], [19] as new users join and connections are established frequently. To analyze an evolving graph over time, evolving graph processing systems take snapshots on a regular basis [20]. However, systems storing multiple full snapshots<sup>1</sup> may waste huge memory space to accommodate redundant data. Modern graph processing frameworks, like



Fig. 1. Data structure for evolving graphs.

LLAMA [21], use *delta snapshot* [22] to save memory <sup>124</sup> space by storing only the updated nodes or edges in each <sup>125</sup> snapshot. In other words, each delta snapshot only contains <sup>126</sup> graph updates (e.g., insertion, modification, and deletion) that <sup>127</sup> occurred after the previous delta snapshot, so all snapshots <sup>128</sup> must be read to traverse the entire graph. With support for <sup>129</sup> delta snapshots, evolving graph processing systems can not <sup>130</sup> only provide version control but also efficiently analyze graphs <sup>131</sup> in the time domain. In the rest of this article, "delta snapshot" <sup>132</sup> and "snapshot" are used interchangeably. <sup>133</sup>

An evolving graph is typically stored in the format of an 134 adjacency list [23], which is also applied to static graphs. An 135 adjacency list maintains a linked list for each vertex to chain all 136 correlated neighbors, and all updates from the same snapshot 137 are grouped in an array [20]. Its structures enables efficient 138 traversal all neighbors of any vertex in the adjacency list. 139 Fig. 1 shows an evolving graph and its adjacency list format. 140 The graph evolves to its third snapshot. The first snapshot 141 includes four insertions (i.e., (1, 3), (0, 3), (0, 1), and (0, 2), 142 with (1, 3) representing the newly inserted edge connecting 143 vertex 1 and 3. The second snapshot contains three insertions. 144 while the third snapshot has two insertions and two deletions. 145 It is worth noting that, in the evolving graph framework, 146 deleting an edge is typically translated into out-place updates. 147 Rather than removing the deleted edge directly, we create a 148 new edge with a negative sign. Out-place update not only 149 lowers the cost of fine-grained memory modification, but it 150 also makes it simple to go back to a previous version of the 151 evolving graph. 152

2) Storage Class Memory: As the graph continues to 153 evolve over time, the sheer volume of data within the graph 154 structure increases proportionally, leading to more frequent 155 and intensive data movements within systems. This growth in 156 data size poses significant challenges for memory management 157 and access efficiency [24]. Fortunately, recent advancements 158 in manufacturing technologies, such as 3-D X-point [25], 159 [26], [27], [28] and ultralow-latency NAND Flash [29], [30], 160 [31], [32]), have paved the way for the emergence of 161 SCM [33], [34]. These innovative memory solutions offer a 162 hybrid approach, combining the speed and byte-addressable 163 access of DRAM with the nonvolatility and higher density 164 of traditional storage devices. Several products and proto- 165 types have emerged to capitalize on these advancements, 166 including Intel<sup>®</sup> Optane<sup>TM</sup> Persistent Memory [35] and HPE 167 NVDIMM [6], [36]. 168

SCM represents a new category of memory devices that 169 combine the desirable characteristics of both DRAM and 170 traditional storage devices. These memory devices offer 171

 $<sup>{}^{1}</sup>A$  full snapshot is a snapshot, which contains the entire graph layout in the moment of taking snapshot.

byte-addressable access granularity with 64-B cacheline
accesses, ensuring efficient data retrieval and manipulation.
Additionally, SCMs feature nonvolatility, allowing data to be
retained even when power is removed, akin to storage devices.
Moreover, SCMs boast lower-unit costs (price/GB) compared
to DRAM and higher-storage density, providing up to 512-GB
per DIMM, making them an attractive solution for memoryintensive applications [37].

Moreover, the integration of SCM into computing systems has been further facilitated by its diverse connectivity options. In addition to occupying traditional DIMM channels, SCM can also be connected via PCIe channels, leveraging the compute express link (CXL) interconnection<sup>2</sup> [38], [39], [40], [41]. This flexibility in connectivity enables SCM to be seamlessly integrated into existing architectures, offering greater scalabilty and adaptability to evolving memory requirements. With SCM's ability to bridge the gap between DRAM and storage, computing systems can achieve enhanced performance and efficiency in handling the growing demands of evolving graph structures and other data-intensive workloads.

However, due to their slower performance and shorter 192 193 lifespan relative to DRAM, SCMs are typically utilized 194 as extensions of DRAM rather than as direct replace-<sup>195</sup> ments [42], [43]. In this hierarchical memory architecture, 196 frequently accessed data (such as inner nodes in tree-data 197 structure) resides in DRAM to leverage its faster access times <sup>198</sup> and lower latency [42]. Conversely, less frequently accessed <sup>199</sup> or large-scale data (such as leaf nodes in tree-data structure) 200 that exceeds DRAM capacity is stored in SCMs, allowing 201 for efficient use of available memory resources. One notable 202 advantage of SCMs is their direct accessibility by CPUs, 203 enabling seamless data transfer from SCMs directly into the 204 CPU cache in cacheline-sized chunks. This direct access 205 capability, discussed extensively in prior research [4], [44], 206 allows for efficient utilization of SCMs alongside DRAM, 207 mitigating the performance impact of slower SCM access <sup>208</sup> times by leveraging CPU cache mechanisms. As a result, SCM <sup>209</sup> adoption offers promising opportunities for improving memory <sup>210</sup> performance and scalability in modern computing systems.

## 211 B. Observation

*1) Delta Snapshots Break Graph Spatial Locality:* Delta snapshots can generate multiple data versions for the same graph, significantly increasing memory usage and necessitating larger memory devices. Another major problem with delta snapshots is a loss of spatial locality. As more snapshots are generated, neighboring vertices are scattered across multiple memory pages, significantly degrading graph traversal performance. In many graph processing algorithms, when a vertex is accessed, all of its 1-hop neighbor vertices are also accessed. Because these neighbors are updated at different times, they are stored on separate memory pages. Many graph processing frameworks write snapshots to pages in chronological order (i.e., by creation time). As a result, vertices



Fig. 2. Storing evolving graph on memory.

physically stored together may be logically distant from one <sup>225</sup> another. Consequently, compared to ideal placement, more <sup>226</sup> pages have to be read in a graph traversal to access each <sup>227</sup> vertex's neighbors, resulting in extra page table walks and TLB <sup>228</sup> accesses. <sup>229</sup>

Fig. 2 shows an example to illustrate the impact of graph 230 spatial locality. Each rectangle represents a 4-kB page, where 231 different gray levels indicate different snapshots. For example, 232 the darkest part implies all graph updates belonging to the 233 third snapshot. Besides, adj stands for an adjacency list, where 234 V0's adj means the adjacency list of V0. Assuming that 235 neighbor vertices belonging to V3 evolves during different 236 time period (e.g., t1, t2, and t3), those updated neighbor 237 vertices are scattered across three snapshots. In this case, it 238 requires 5 page accesses (marked by red lines) to explore 239 V3's neighbors, where each page access might cause 1 TLB 240 access and at most 4 memory accesses for walking page 241 tables. Even worse, the performance degradation becomes 242 more serious where systems shall explore most of the vertices 243 for traversing a graph, instead of exploring only one vertex. 244 Although some frameworks, such as LLAMA, can alleviate 245 the performance impact by periodically merging multiple snap- 246 shots, the performance of graph traversing becomes unstable 247 and fluctuates seriously. The reason is that, the graph traversing 248 reaches best performance right after running snapshots merg- 249 ing, but it becomes worse until triggering the next merging. 250

Fig. 2 demonstrates the impact of graph spatial locality. 251 Each rectangle represents a 4-kB page, and the different 252 gray levels indicate different snapshots. For example, the 253 darkest part denotes all graph updates from the third snapshot. 254 Furthermore, "adj" stands for an adjacency list, and "V0's adj" 255 denotes V0's adjacency list. Assuming that neighbor vertices 256 in V3 evolve over different time periods (e.g., t1, t2, and 257 t3), the updated neighbor vertices are distributed across three 258snapshots. In this case, it takes 5 page accesses (marked by 259 red lines) to traverse V3's neighbors, with each page access 260 potentially resulting in 1 TLB access and up to 4 memory 261 accesses for walking page tables. Even worse, performance 262 degradation worsens when systems must traverse all of the 263 vertices in order to traverse a graph, rather than just one. 264 Although some frameworks, such as LLAMA, can mitigate the 265 performance impact by periodically merging multiple snap- 266 shots, graph traversal performance still fluctuates significantly: 267 traversal performs best immediately after a snapshot merge, 268 but gradually degrades thereafter. 269

<sup>&</sup>lt;sup>2</sup>CXL is a high-speed interconnect technology that facilitates efficient communication between CPUs and accelerators, including memory devices, to enable heterogeneous computing architectures.



Fig. 3. Weak graph spatial locality hurts performance (Dataset: Friendster). (a) Execution time. (b) TLB miss rate. (c) Page utilization.

2) Performance Impact Under Weak Graph Spatial 270 271 Locality: We conducted a series of experiments to validate 272 our findings. Fig. 3 shows the performance results. We use 273 LLAMA [21] as an example, which is a representative 274 evolving graph framework. Without loss of generality, the 275 LLAMA merge frequency is set to every 500 snapshots. To <sup>276</sup> simulate graph evolution, we divide a large graph, Friendster,<sup>3</sup> 277 into 10000 snapshots, and the graph will eventually evolve 278 (or update) 10000 times. We evaluated graph traversal performance by running the Dijkstra algorithm every 100 snap-279 shots using two approaches. The first approach is LLAMA, 280 which stores snapshots in SCMs. The second approach is 281 282 called "optimal." It merges all snapshots in DRAM and <sup>283</sup> immediately rewrite a new graph to SCMs. This approach has 284 the strongest spatial locality but can suffer from high-update 285 overhead.

Fig. 3(a) and (b) show evaluation results for overall exe-286 287 cution time and TLB miss rate, respectively. The x-axis in both figures represents the number of archived snapshots. 288 289 Fig. 3(a) shows that the placement issue may significantly affect the execution time, with the system adopting LLAMA 290 spending 5 times more execution time than the system running 291 <sup>292</sup> the optimal approach. Running LLAMA breaks graph spatial 293 locality, causing the CPU to read extra pages, resulting in high-TLB misses and frequent page table walks, as shown in 294 <sup>295</sup> Fig. 3(b). Furthermore, it is obvious that the performance of running graph traversal is unstable when using LLAMA. This 296 unstable performance will degrade the user experience. Even 297 worse, frequently merging snapshots may result in frequent 298 <sup>299</sup> access to SCMs, which consumes additional energy.

The above experiment shows that weak graph spatial locality can reduce page utilization. The page utilization of each vertex is defined as the ideal memory size occupied by the vertex's neighbors divided by the memory size occupied by add the vertex's neighbors. For example, the total size of V1's adjacency list (i.e., all of V1's neighbors) is less than the size of one page, requiring only one memory page to store it. In reality, V1's page utilization is less than 10% because its neighbors are scattered across 10 memory pages.

Fig. 3(c) shows page utilization for a system with varying snapshots. The *x*-axis shows the number of snapshots owned snapshots. The *x*-axis shows the number of snapshots owned snapshots owned snapshots owned snapshots. To better demonstrate the trend, we divide page snapshots of the categories: 1) 0%–40%; 2) 41%–90%; and 3) 91%–100%. As the system generates more snapshots, <sup>314</sup> the number of vertices with page utilization between 91% and <sup>315</sup> 100% decreases significantly. <sup>316</sup>

317

330

#### C. Motivation

This work is strongly motivated by the need to improve <sup>318</sup> the traversing performance for the SCM-based evolving graph <sup>319</sup> systems by keeping strong graph spatial locality for all <sup>320</sup> vertices. We propose a joint management middleware that <sup>321</sup> performs both memory allocations and data placements for <sup>322</sup> evolving data while taking into account graph spatial locality. <sup>323</sup> The major technical challenges are 1) how to maintain strong <sup>324</sup> graph spatial locality while the graph evolves, and 2) how to <sup>325</sup> intelligently place and rewrite data on SCMs without causing <sup>326</sup> excessive energy consumption. <sup>327</sup>

## III. SCM-BASED GRAPH-EVOLVING AWARE 328 DATA ARRANGER 329

#### A. Overview

This section introduces our SCM-Based GEAR, designed <sup>331</sup> to maintain strong graph spatial locality by consolidating <sup>332</sup> all neighbors of each vertex on the SCM while minimiz- <sup>333</sup> ing energy consumption. Technically, SCM-based GEAR <sup>334</sup> serves as middleware between the graph application and <sup>335</sup> the SCM device, bridging the information gap between <sup>336</sup> them. Implementing GEAR as middleware not only facilitates <sup>337</sup> information exchange but also ensures high compatibility, <sup>338</sup> avoiding the need to modify either the application or the <sup>339</sup> devices. Fig. 4 provides an overview of our design, which <sup>340</sup> comprises four key components: 1) multilevel page allocation; <sup>341</sup> 2) locality-aware data placement; 3) dual-granularity wear <sup>342</sup> leveler; and 4) graph updates accumulation. <sup>343</sup>

Our multilevel page allocation component partitions and <sup>344</sup> allocates SCM memory areas to store all neighbors associated <sup>345</sup> with each vertex. Each vertex's degree (the number of the <sup>346</sup> vertex's edges) determines the size of its allocated SCM <sup>347</sup> memory area. Furthermore, our locality-aware data placement <sup>348</sup> mechanism ensures that all evolved graph data (i.e., newly <sup>349</sup> updated edges) related to the same source vertex are stored <sup>350</sup> in the corresponding SCM memory area, thereby preserving <sup>351</sup> strong graph spatial locality. Additionally, our dual-granularity <sup>352</sup> evenly distribute graph updates across all memory pages in <sup>354</sup> SCM during graph evolution. <sup>355</sup>

<sup>&</sup>lt;sup>3</sup>The dataset is from Stanford network analysis project (SNAP) [45].



Fig. 4. System architecture.

| i i i i i i i i i i i i i i i i i i i | A 64B subp  | uge                                        |      |     |             |              |                |              |             |
|---------------------------------------|-------------|--------------------------------------------|------|-----|-------------|--------------|----------------|--------------|-------------|
| 64B page                              | 64B         | 64B                                        | 64B  | 64B | 64B         | 64B          | 64B            | 64B          | x 64        |
|                                       | A 128B sub  | page                                       |      |     |             |              |                |              |             |
| 128B page                             | 128B        |                                            | 128B |     | 128B        |              | 128B           |              | x 32        |
|                                       | A 256B sub  | page                                       |      |     |             |              |                |              |             |
| 256B page                             | 256B        |                                            |      |     | 256B        |              |                |              | x 16        |
| 512B, 1024E                           | 3, 2048B pa | ge :                                       |      | S   | trong grapl | h spatial lo | cality: Let al | II neighbors | of each ver |
|                                       | A 4KB page  | 4KB page stored in a subpage even after th |      |     |             |              |                | graph has e  | volved over |
| 4096B page                            |             |                                            |      |     | 4096B x 1   |              |                |              |             |

Fig. 5. Example of multilevel page allocation.

Finally, the graph updates accumulation policy buffers 356 <sup>357</sup> incoming graph updates in DRAM. It employs a data structure 358 called an edge log array to facilitate quick querying of these 359 new edges without traversing the entire graph. Because SCM <sup>360</sup> has a higher-write latency than DRAM, our design prioritizes staging new graph updates in DRAM for quick ingestion. 361 The buffered data are subsequently transferred to SCMs in 362 batches, referred to as snapshots. The edge log array maintains 363 incoming graph updates in a first-in-first-out (FIFO) manner, 364 with each update containing three fields: 1) the source vertex; 365 the destination vertex; and 3) the edge weight between them. 366 2) <sup>367</sup> It is worth noting that such stage-and-flush design is widely <sup>368</sup> used in many graph systems, so we will not go into specific 369 design details.

#### 370 B. Multilevel Page Allocation

SCM-based GEAR aims to maintain strong graph spatial sr2 locality by consolidating all neighbors belonging to each world graphs, hub vertices, which are those with extremely sr5 high degrees, have significantly more neighbors than nonhub vertices. Celebrities in social networks are an excellent examsr7 ple of a hub vertice: their graph neighbors can be hundreds, sr6 if not thousands, of times more than regular users (nonhub sr9 vertices).

Traditionally, most systems allocate memory areas (or pages) of 4 kB. To reduce maintenance costs of graphs vevolution, it is common to allocate a 4-kB page for each hub or nonhub vertex. However, this allocation results in low-page utilization. For example, assume that storing one neighbor edge requires approximately 8 bytes (including the index of <sup>385</sup> the neighbor vertex and the edge weight). Then, storing a <sup>386</sup> nonhub vertex with 100 neighbor edges would only require <sup>387</sup> 800 bytes, well below the 4-kB capacity. Even if the combined <sup>388</sup> neighbors of some hub vertices can fill a 4-kB page, the <sup>389</sup> memory requirement might expand over time and no longer fit <sup>390</sup> within the 4-kB memory area as the graph evolves. A simple <sup>391</sup> solution would be to divide a 4-kB page into smaller sizes, <sup>392</sup> but this would require significant maintenance overhead and <sup>393</sup> result in severe space fragmentation. <sup>394</sup>

The multilevel page allocation strategy in SCM-based 395 GEAR relies on two fundamental principles. First, it aims 396 to minimize maintenance overhead by organizing memory 397 areas into sizes aligned with seven predefined levels (each a 398 power of two in size): 64, 128, 256, 512, 1024, 2048, and 399 4096 B. To provide a clearer understanding of this concept, 400 Fig. 5 visually illustrates the relationship between a 4-kB page 401 and its potential partitioned levels. For instance, a 4-kB page 402 can be partitioned into 64 64-B subpages, with each subpage 403 dedicated to storing neighbors from the same vertex. Second, 404 the allocation process chooses an appropriate memory area 405 size from among the available options based on the vertex's 406 degree. This adaptive approach ensures that memory allocation 407 is tailored to the each vertex's specific characteristics, resulting 408 in improved performance and resource utilization. 409

GEAR uses the mmap system call to obtain multiple 4-kB <sup>410</sup> pages from the operating system (OS). The multilevel page <sup>411</sup> allocation partitions each 4-kB page into identically sized <sup>412</sup> subpages that fall into one of seven predefined levels. The <sup>413</sup> required size for storing all neighbors associated with a vertex <sup>414</sup> is estimated using the vertex's degree, and a subpage that <sup>415</sup> meets or exceeds this requirement is allocated. This design has <sup>416</sup> a low overhead for multilevel page allocation, requiring only a <sup>417</sup> few extra bits per subpage to find a subpage's location within <sup>418</sup> a 4-kB page. Furthermore, it mitigates the fragmentation issue <sup>419</sup> of fixed-size memory areas. Additionally, the alignment of <sup>420</sup> subpage sizes with the CPU cacheline<sup>4</sup> (64 B) ensures that <sup>421</sup> unused data is not transferred from SCMs to the CPU cache, <sup>422</sup> thereby optimizing data transfer efficiency. <sup>423</sup>

Fig. 6(a) and (b) show the data structures used by GEAR 424 to manage the mapping between each 4-kB page and its 425 subpages. The page metadata table [Fig. 6(a)] stores all rele-426 vant information about each 4-kB page. The granularity flag 427 indicates the size of the corresponding subpage, represented by 428 a 3-bit binary number (for example, a subpage size of 2048 B 429 is denoted as 110). The empty flag indicates whether or not 430 the subpage has been allocated.

The available page lists [Fig. 6(b)] consist of seven arrays, <sup>432</sup> which include a free page list and six size-specific available <sup>433</sup> page lists. The free page list contains all 4-kB pages that <sup>434</sup> have not yet been divided into subpages. Each size-specific <sup>435</sup> available page list corresponds to one of the six subpage levels <sup>436</sup> (64 to 2048 B), maintaining all associated 4-kB pages with <sup>437</sup> unallocated subpages. This design requires only a 4-byte page <sup>438</sup> index to track each page. For example, given a 1-GB SCM, <sup>439</sup>

<sup>&</sup>lt;sup>4</sup>A cacheline is the smallest unit of data that can be transferred between main memory and the CPU cache.



Fig. 6. Data structures for maintaining evolving graph data on SCMs. (a) Page metadata table. (b) Available page list. (c) Vertex-to-page table.

<sup>440</sup> there are 262, 144 4-kB pages, and the overall seven arrays <sup>441</sup> consume 1 MB (i.e., 262 and 144 pages  $\times$  4 B).

Lastly, GEAR features a vertex-to-page table [Fig. 6(c)] to track the relationship between each vertex and its associated page information. This includes the 1-byte vertex's subpage index, 4-byte page index, and 2-byte vertex degree. Based on calculations, the combined space overhead of all three data our calculations, the combined space overhead of all three data tructures (i.e., vertex-to-page table, available page list, and page metadata table) accounts for less than 5% of the total graph size.

Let us use an example to demonstrate the page allocation process. To assign a 2048-B subpage to a vertex [e.g., V0 in Fig. 6], the process starts by checking the 2048-B available page list. If it is empty, the system chooses a page from the free page list. The metadata table is then updated, with the size as 2048 B. The vertex-to-page table is then updated to associate V0 with the allocated page index, with the subpage index set to 0 and the degree recorded as 220. This allows for efficient management and retrieval of graph data during evolution and traversal.

#### 461 C. Locality-Aware Data Placement

The locality-aware data placement strategy aims to maintain strong graph spatial locality while transferring accumulated graph updates from DRAM to SCMs to generate a snapshot. As part of this strategy, the multilevel page allocation ensures that each vertex's subpage is sufficiently sized<sup>5</sup> to accommodate all its neighbors. Consequently, each subpage typically contains unused space, known as the reserved area. This reserved area serves as a designated space for future prograph updates to different vertices remain segregated, thus preserving transferred area serves all vertices.

When incorporating a new graph update into a subpage's reserved area, two scenarios may occur: 1) the reserved area for the targeted subpage is either sufficient (i.e., not full) or market for the targeted subpage is either sufficient (i.e., not full) or market for the targeted subpage is either sufficient (i.e., not full) or market for the targeted subpage is either sufficient (i.e., not full) or market for the targeted subpage is either sufficient (i.e., not full) or market for the targeted subpage is either sufficient (i.e., not full) or market for the targeted subpage is either sufficient, the market for the targeted subpage is either subficient, our market for a contrast, if the reserved area is insufficient, our market for approach requires rewriting all previous data within the submarket for a page, including the entire adjacency list, to a larger subpage. This ensures that the most recent updates are accommodated while preserving strong graph spatial locality for each vertex. Even for node deletion, the graph system generates a new graph update, as explained in Section II-A1. That is, whenever



Fig. 7. Two scenarios for the reserved area. (a) Reserved area is not full. (b) Reserved area is full, rewrite data to a larger subpage.

a neighbor is removed from a vertex, a new edge with a 485 negative value is appended to the adjacency list.

For instance, Fig. 7 shows how locality-aware data placement works when writing all graph updates associated with 488 source vertex V0 to the SCM. The notation " $(0, 1, W_{0,1})$ " 489 means the edge value between source vertex V0 and its neighbor vertex V1 is updated to  $W_{0,1}$ . There are two cases: when 491 the corresponding reserved area in the SCM is insufficient or 492 sufficient. In both cases, all graph updates are buffered in the edge log array in DRAM. In the case where the reserved area is sufficient, as depicted in Fig. 7(a), our policy directs the 495 writing of all graph updates associated with vertex V0 to the corresponding subpage, which belongs to the 128-B level, in 497 the SCM.

On the other hand, in Fig. 7(b), the reserved area of the <sup>499</sup> subpage associated with vertex V0 lacks enough free space to <sup>500</sup> accommodate graph updates associated with vertex V0. Given <sup>501</sup> that the adjacency list of vertex V0 was originally stored in <sup>502</sup> a 128-B subpage, the data placement mechanism collaborates <sup>503</sup> with the multilevel page allocation to obtain an empty 256-B <sup>504</sup> subpage capable of storing both the old adjacency list and all <sup>505</sup> new updates for vertex V0. Subsequently, the old adjacency <sup>506</sup> list of vertex V0, along with its latest updates from DRAM, is <sup>507</sup> transferred and rewritten to the newly allocated 256-B subpage in the SCM. <sup>509</sup>

It is important to point out that our strategy only rewrites 510 subpages with insufficient reserved area, rather than rewriting 511 4096-B subpages equivalent to a normal page. Consequently, 512 compared to the merging strategy employed by state-of-the-art 513 frameworks, our strategy achieves strong graph spatial locality 514 for each vertex with fewer writes. 515

## D. Dual-Granularity Wear Leveler

Data updates on real-world graphs exhibit a high degree of 517 skew, a phenomenon well-documented in [46], [47], and [48]. 518

<sup>&</sup>lt;sup>5</sup>The size of the subpage must be greater than or equal to the space currently occupied by all neighbors belonging to the vertex.



Fig. 8. Interpage wear-leveling mechanism.

519 This skew is primarily attributed to hub vertices that 520 are densely connected to numerous neighboring vertices. Consequently, these hub vertices undergo more frequent 521 <sup>522</sup> updates compared to other vertices. Such skewed updates pose significant challenge in the context of SCM, which has 523 a limited lifetime. Moreover, our design's manipulation of 524 a <sup>525</sup> subpage allocation introduces a further layer of complexity, potentially resulting in disparate write counts among subpages 526 within the same 4-kB memory page. This disparity exacerbates the wear leveling issue, necessitating a comprehensive 528 approach to address wear leveling not only across all 4-kB 529 530 pages but also within each 4-kB page. To tackle this challenge 531 comprehensively, we propose a dual-granularity wear leveler 532 comprising both an interpage wear-leveling mechanism and an intrapage wear-leveling mechanism. 533

We design the interpage wear-leveling mechanism to ensure 534 uniform distribution of write counts across all memory 535 a 536 pages. The main idea is to consistently select the healthier 537 page during memory allocation. To facilitate this process, we 538 maintain a per-page write count for each memory page in <sup>539</sup> the page metadata table, as depicted in Fig. 6(a). Technically, e use a multilevel page list, where each page list bounds 540 w the minimum remaining write counts for each page within 541 542 it. The minimum remaining write count associated with the 543 highest level is determined based on the ideal lifetime of the 544 SCM device. To reduce maintenance overhead, we categorize 545 the minimum remaining write count for each level in an 546 exponential manner. For example, if an SCM device can <sup>547</sup> endure at most 10<sup>8</sup> write accesses per cell, our mechanism 548 configures the page list into six levels: 1)  $10^7$ ; 2)  $10^6$ ; 3)  $10^5$ ;  $_{549}$  4) 10<sup>4</sup>; 5) 10<sup>3</sup>; and 6) 10<sup>2</sup>, as illustrated in Fig. 8. The number 550 of each level denotes the minimum remaining write count. The remaining write count for each page is calculated as  $10^8$ 551 <sup>552</sup> minuses the page write count. For example, 10<sup>7</sup> indicates 553 that the page belonging to this level can withstand at least  $_{554}$  10<sup>7</sup> more write operations. This meticulous categorization 555 ensures an even distribution of write operations across memory 556 pages, thereby effectively mitigating wear-leveling issues at 557 the interpage level.

As detailed in Section III-C, insufficient space in a subpage designated for storing graph updates the rewriting of all data from the original subpage to a larger subpage. Such movement for of vertices between subpages within the same 4-kB page can lead to wear-unleveling issues. To address this concern, we introduce an intrapage wear-leveling mechanism, which



Fig. 9. Intrapage wear-leveling mechanism. (a) Available page: Write flag == 0 and empty flag == 0. (b) No available page. (c) Reset write flag and per page write count + 1.

is implemented by maintaining a 1-bit write flag and a 1-bit 564 empty flag in the page metadata table for each subpage. The 565 write flag records whether the subpage has been written in the 566 current round, with 1 indicating that it has been written and 567 0 indicating otherwise. Additionally, the empty flag denotes 568 whether the subpage is currently used by a vertex's adjacency 569 list, with 1 indicating used and 0 indicating availability. 570

As shown in Fig. 9(a), we only allocate a subpage when <sup>571</sup> both the write flag and empty flag are 0, to ensure balanced <sup>572</sup> write counts across all subpages within the same 4-kB page. <sup>573</sup> When none of the pages in the available page list have <sup>574</sup> available subpages, it indicates most of the subpages in these <sup>575</sup> pages were written during this round. Thus, we reset all <sup>576</sup> the write flags of the available page list to 0 and increment <sup>577</sup> the per-page write count by 1. Subsequently, all subpages <sup>578</sup> become available again, as depicted in Fig. 9(b) and (c). <sup>579</sup> This approach not only maximizes the utilization of available <sup>580</sup> subpages but also ensures the amortization of write counts <sup>581</sup> across all subpages within the same 4-kB page, effectively <sup>582</sup> mitigating wear leveling issues at the intrapage level. <sup>583</sup>

#### IV. PERFORMANCE EVALUATION

584

585

#### A. Evaluation Setup and Performance Metrics

This section evaluates the efficacy of GEAR in enhancing 586 the performance of both graph traversal and graph evolution. 587 We thoroughly compared SCM-based GEAR to three baseline 588 approaches, checking their performance in a number of areas, 589 such as execution time, TLB miss rate, CPU cache miss 590 rate, energy use, and the number of writes to the SCM. 591 The three baseline approaches consist of two state-of-the-art 592 evolving graph processing frameworks: 1) LLAMA [21], configured with merge frequencies set to 100 and 500 snapshots 594 and 2) GraphOne [23], which incorporates cache-line-sized 595 memory allocation and hub vertex handling. GraphOne's 596 memory allocation strategy provides a cache-line-sized (i.e., 597 64 bytes) area for storing nonhub vertices, while its hub vertex 598



Fig. 10. Normalized execution time of running graph algorithms on evolving graphs. (a) Dataset: Orkut. (b) Dataset: Twitter-2010. (c) Dataset: Friendster.

<sup>599</sup> handling allocates a 4-kB page to accommodate all edges <sup>600</sup> belonging to a hub vertex. It is noteworthy that, according to <sup>601</sup> the literature, GraphOne includes delta checkpointing similar <sup>602</sup> to LLAMA, but GraphOne does not enable checkpointing by <sup>603</sup> default. Without loss of generality, all of the three baseline <sup>604</sup> approaches place frequently accessed data in DRAM and less <sup>605</sup> frequently accessed data in SCM.

To ensure a comprehensive evaluation, we selected three 606 <sub>607</sub> representative datasets from the SNAP [45]: 1) Orkut: 2) Twitter-2010; and 3) Friendster. The Orkut dataset encom-608 passes 3 million vertices and 0.23 billion edges. The Twitter 609 2010 dataset comprises 40 million vertices and 1.5 billion 610 edges. Lastly, the Friendster dataset includes 56 million 611 vertices and 2.6 billion edges. We selected these datasets to 612 offer a wide variety of graph sizes and complexities, enabling 614 a thorough assessment of the performance of SCM-based 615 GEAR.

We segment each graph dataset into 10 000 snapshots to simulate the graph evolution process. We execute three widely used graph traversal algorithms—breadth-first search (BFS), Dijkstra (single source shortest path algorithm), and Random Walk algorithms—on the evolving graph at intervals of 100 snapshots. We capture memory traces during the traversal and subsequently replay them on our trace-based simulator. Our simulator simulates an Intel Skylake architecture with a fully associative TLB comprising 1536 entries and an 8 624 MB, 16-way associative L3 cache [49], [50]. The read/write 625 latency for DRAM and SCMs is set to 50/50 and 120/150 626 ns, respectively, based on previous studies [6]. Additionally, it 627 accounts for the energy consumption associated with writing 628 a bit to the SCM, estimated at 16.82 pJ per bit [51]. The 629 simulation environment is hosted on a server featuring an Intel 630 Xeon Gold 6252n CPU, 768 GB of DRAM, and running Linux 631 kernel version 5.4. This setup ensures a realistic emulation 632 evolving graph scenarios, enabling a thorough evaluation of 634 SCM-based GEAR and baseline strategies.

## B. Evaluation Results

1) Performance Evaluation of Graph Traversal: In this 637 section, we focus on demonstrating the performance of 638 traversing an evolved graph. Unlike graph evolution, graph 639 traversal does not require additional data writes and therefore 640 does not impact the system's lifetime. Fig. 10(c) presents the 641 results of the total execution time when running SCM-based 642 GEAR against the three baseline approaches. Specifically, 643 Fig. 10(a) and (b) depict the results obtained from executing 644 traversal algorithms on the Twitter-2010 and Friendster datasets, 645 respectively. The *x*-axis of each figure represents the number 646



Fig. 11. TLB and CPU cache miss rate of running graph traversal algorithms (DataSet: Friendster). (a) TLB miss rate. (b) CPU cache miss rate.

<sup>647</sup> of archived snapshots, while the *y*-axis displays the execution <sup>648</sup> time normalized to that of GraphOne. Due to the similarity in <sup>649</sup> performance trends across all 10 000 archived snapshots, we <sup>650</sup> only present results for the interval between 2000 and 2500 <sup>651</sup> archived snapshots.

The results reveal that SCM-based GEAR achieves exe-652 653 cution time savings ranging from 23% to 70% compared to 654 GraphOne, 0% to 74% compared to LLAMA (merge 500), and 0% to 27% compared to LLAMA (merge 100). These savings 655 656 are attributed to SCM-based GEAR's ability to maintain strong graph spatial locality, leading to fewer TLB misses when 657 658 accessing pages during graph traversal. Notably, SCM-based GEAR achieves an execution time reduction comparable to 659 660 LLAMA, especially when the graph algorithm executes immediately after LLAMA triggers snapshot merging. However, 661 LLAMA's performance may exhibit instability, and frequent 662 snapshot merging, such as every 100 snapshots, can lead to 663 664 excessive energy consumption (further details are provided in 665 Section IV-B2).

To provide a detailed breakdown evaluation, Fig. 11 show-666 667 cases the TLB miss rate and CPU cache miss rate results obtained when running SCM-based GEAR against the three 668 baseline approaches on the Friendster dataset. In each figure, 669 the x-axis represents the number of archived snapshots, while 670 the y-axis depicts the TLB miss rate in Fig. 11(a) and the CPU 671  $_{672}$  cache miss rate in Fig. 11(b). It is evident from the figures that SCM-based GEAR consistently maintains a relatively low-673 TLB miss rate and CPU cache miss rate across all numbers 674 675 of snapshots. Conversely, the TLB miss rate and CPU cache miss rate observed in systems running GraphOne and LLAMA 676 677 exhibit fluctuations, occasionally exceeding 90% (except the 678 CPU cache miss rate caused by running a random walk), 679 depending on the number of snapshots created. GraphOne 680 experiences exceptionally high-TLB miss rates and CPU cache miss rates due to the absence of a snapshot merging strategy to preserve graph spatial locality during graph evolution. In 662 contrast, LLAMA (merge 500) achieves relatively low-TLB 663 miss rates and CPU cache miss rates every 500 snapshots 684 when the snapshot merging strategy is executed, but the rate 665 steadily increases to around 90%. Employing LLAMA with 6666 frequent snapshot merging, such as LLAMA (merge 100), can 667 mitigate the occurrence of excessively high-TLB miss rates. 6689 However, the frequent merging strategy significantly prolongs 6699 the graph evolution process and, even worse, adversely affects 6690 the SCM's lifespan. More detailed evaluations of evolving time 691 and memory endurance will be presented in the subsequent 692 subsections. 693

2) Performance and Lifetime Evaluation of Graph Evolution: 694 Fig. 12 presents a comprehensive evaluation of both the 695 performance and lifetime aspects of graph evolution. In 696 Fig. 12(a), the time taken to evolve the graph to a specific 697 number of snapshots using different approaches is depicted. 698 The x-axis ranges from 1000 to 6000, representing the number 699 of snapshots, while the y-axis indicates the time for graph 700 evolution normalized to GraphOne. The evaluation shows 701 that systems running SCM-based GEAR exhibit superior 702 evolving performance compared to LLAMA due to GEAR's 703 lower-time complexity. Conversely, LLAMA incurs greater 704 time consumption due to the periodic merging of snapshots, 705 necessitating the rewriting of all snapshots. For example, 706 LLAMA (merge 100) causes a longer graph evolution time 707 than LLAMA (merge 500) because snapshot merging is 708 triggered more frequently. Furthermore, SCM's high-write 709 latency contributes to the extended time required for graph 710 evolution. 711

For a more detailed analysis of the graph evolution 712 performance, Fig. 12(b) illustrates the total edge write counts 713 to SCM for each system at intervals of 50 snapshots between 714 2000 and 2500 snapshots. The *x*-axis represents the number 715 of archived snapshots, while the *y*-axis indicates the total edge 716



Fig. 12. Performance and lifetime evaluation on evolving a graph. (a) Time spent evolving the graph (DataSet: twitter). (b) Total number of edge writes. (c) Maximum write counts among all pages.

717 writes normalized to GraphOne. GraphOne, which does not 718 incorporate snapshot merging in our experiments, does not 719 generate extra writes. In contrast, SCM-based GEAR may 720 produce more edge writes than GraphOne if the reserved 721 space of a vertex is insufficient, leading to the rewriting 722 of the original adjacency list to a newly allocated larger <sub>723</sub> subpage along with the latest updates. However, because the CM-based GEAR only rewrites vertices with inadequate 724 S reserved space and limits the maximum rewriting size to 2048 725 bytes, its total edge writes are only about 2.1 times higher 726 than GraphOne's. Importantly, GEAR's total edge writes are 727 significantly lower than those of LLAMA, which merges 728 729 snapshots on a regular basis.

While LLAMA may achieve faster graph traversal by 730 731 merging snapshots more frequently, this has a significant impact on SCM's lifespan. Fig. 12(c) illustrates the normalized 732 733 maximum page write count as the system generates snapshots <sub>734</sub> ranging from 1000 to 10000. The x-axis denotes the number 735 of snapshots, while the y-axis represents the maximum page rite count normalized to GraphOne. The results demonstrate 736 737 that our dual-granularity wear leveler is effective in mitigating 738 the increase in maximum page write count. This effectiveness 739 stems from the distribution of writes to a finer granularity, 740 specifically at the subpage level. Conversely, when LLAMA 741 frequently merges snapshots, the maximum page write count 742 experiences a sharp escalation, as observed in the case of LLAMA (merge 100). 743

3) Evaluation on Energy Consumption: We further evalu-744 745 ate the energy consumption associated with different graph <sup>746</sup> evolution approaches, as depicted in Fig. 13. The x-axis rep-747 resents the number of snapshots ranging from 1000 to 10000, while the y-axis indicates the energy consumption. Each plot 748 749 in the figure illustrates the cumulative energy consumption 750 required to execute the total number of snapshots indicated on <sup>751</sup> the x-axis. Fig. 13 highlights that SCM-based GEAR exhibits 752 relatively low-energy consumption compared to LLAMA. This is primarily because the rewrite operations triggered by 753 SCM-based GEAR result in fewer write accesses on SCMs 754 755 compared to the merging operations triggered by LLAMA. 756 Notably, GraphOne, which does not perform any snapshot merging operations, consumes the least energy among all 757 758 solutions. Although SCM-based GEAR consumes more energy than GraphOne, its scalability remains intact. This is evidenced 759 760 by the consistent energy consumption gap between GraphOne <sup>761</sup> and SCM-based GEAR, even as the graph evolves over time.

#### **Energy Consumption**



Number of Snapshots

Fig. 13. Energy consumption (Dataset: Friendster).

In contrast, LLAMA's energy consumption exhibits a linear 762 increase as the graph evolves, making it nonscalable. For 763 instance, LLAMA (merge 100) consumes 3 and 21 times more 764 energy than SCM-based GEAR when there are 1000 snapshots 765 and 10 000 snapshots, respectively. 766

Our research addresses the challenge of weak graph spatial 768 locality in evolving graph frameworks, which hinders efficient 769 execution of graph traversal algorithms. To mitigate this 770 issue, we introduce SCM-Based GEAR, a joint management 771 middleware that optimizes the arrangement and storage of 772 evolving graph data in both DRAM and SCMs. GEAR 773 comprises multilevel page allocation, locality-aware data 774 placement, and dual-granularity wear leveling components. 775 GEAR improves graph traversal performance while maintaining 776 strong graph spatial locality as the graph changes. It does this by 777 allocating subpages based on vertex-neighboring relationships, 778 keeping unused areas for future updates, and evenly spreading 779 write operations. Our evaluation demonstrates the effectiveness 780 of SCM-based GEAR, showing significant improvements in 781 execution time savings ranging from 23% to 70% compared to 782 state-of-the-art frameworks. Through meticulous management 783 of evolving graph data across memory devices, GEAR achieves 784 superior performance in traversing evolving graphs, addressing 785 critical challenges posed by weak graph spatial locality. 786

#### References

C.-F. Wu, C.-J. Wu, G.-Y. Wei, and D. Brooks, "A joint management 788 middleware to improve training performance of deep recommendation 789 systems with SSDs," in *Proc. 59th ACM/IEEE Design Autom. Conf.*, 790 2022, pp. 157–162.

- [2] W. Fan et al., "Graph neural networks for social recommendation," in
   *Proc. World Wide Web Conf.*, 2019, pp. 417–426.
- [3] J. Leskovec, J. Kleinberg, and C. Faloutsos, "Graph evolution:
  Densification and shrinking diameters," *ACM Trans. Knowl. Discov. Data*, vol. 1, no. 1, p. 2, 2007.
- [4] G. Dhiman, R. Ayoub, and T. Rosing, "PDRAM: A hybrid PRAM and DRAM main memory system," in *Proc. 46th Annu. Design Autom. Conf.*,
- 2009, pp. 664–469.
  [5] Y.-C. Chen, C.-F. Wu, Y.-H. Chang, and T.-W. Kuo, "Exploring
- synchronous page fault handling," *IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.*, vol. 41, no. 11, pp. 3791–3802, Nov. 2022.
- [6] C.-F. Wu, Y.-H. Chang, M.-C. Yang, and T.-W. Kuo, "Joint management
   of CPU and NVDIMM for breaking down the great memory wall," *IEEE Trans. Comput.*, vol. 69, no. 5, pp. 722–733, May 2020.
- [7] C. J. Xue, G. Sun, Y. Zhang, J. J. Yang, Y. Chen, and H. Li, "Emerging non-volatile memories: Opportunities and challenges," in *Proc. 7th IEEE/ACM/IFIP Int. Conf. Hardw./Softw. Codesign Syst. Synth.*, 2011, pp. 325–334.
- [8] B. Van Essen, H. Hsieh, S. Ames, R. Pearce, and M. Gokhale, "DI-MMAP—A scalable memory-map runtime for out-of-core data-intensive applications," *Clust. Comput.*, vol. 18, pp. 15–28, Mar. 2015.
- [9] J. Zhang et al., "Revamping storage class memory with hardware automated memory-over-storage solution," in *Proc. ACM/IEEE 48th Annu. Int. Symp. Comput. Archit. (ISCA)*, 2021, pp. 762–775.
- 816 [10] S. Brin and L. Page, "The anatomy of a large-scale hypertextual
  Web search engine," *Comput. Netw. ISDN Syst.*, vol. 30, nos. 1–7,
  pp. 107–117, 1998.
- 819 [11] A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, and S. Muthukrishnan,
- "One trillion edges: Graph processing at Facebook-scale," *Proc. VLDB Endowment*, vol. 8, no. 12, pp. 1804–1815, 2015.
- M. Zhang, Y. Wu, Y. Zhuo, X. Qian, C. Huan, and K. Chen,
  "Wonderland: A novel abstraction-based out-of-core graph processing
  system," *ACM SIGPLAN Notices*, vol. 53, no. 2, pp. 608–621, 2018.
- [13] Z. Ai, M. Zhang, Y. Wu, X. Qian, K. Chen, and W. Zheng, "Squeezing out all the value of loaded data: An out-of-core graph processing system with reduced disk i/O," in *Proc. USENIX Annu. Tech. Conf. (USENIX ATC)*, 2017, pp. 125–137.
- [14] A. Kyrola, G. Blelloch, and C. Guestrin, "GraphChi: Large-scale graph
   computation on just a PC," in *Proc. 10th USENIX Symp. Oper. Syst. Design Implement. (OSDI)*, 2012, pp. 31–46.
- [15] D. Zheng, D. Mhembere, R. Burns, J. Vogelstein, C. E. Priebe, and
   A. S. Szalay, "FlashGraph: Processing billion-node graphs on an array
   of commodity SSDs," in *Proc. 13th USENIX Conf. File Storage Technol.*
- (FAST), 2015, pp. 45–58.
  [16] H. Liu and H. H. Huang, "Graphene: Fine-grainedIO management for
- graph computing," in *Proc. 15th USENIX Conf. File Storage Technol.* (*FAST*), 2017, pp. 285–300.
- K. K. Matam, G. Koo, H. Zha, H.-W. Tseng, and M. Annavaram,
  "GraphSSD: Graph semantics aware SSD," in *Proc. 46th Int. Symp. Comput. Archit.*, 2019, pp. 116–128.
- R. Kumar, J. Novak, and A. Tomkins, "Structure and evolution of online social networks," in *Proc. 12th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min.*, 2006, pp. 611–617.
- T. Yang, Y. Chi, S. Zhu, Y. Gong, and R. Jin, "Detecting communities and their evolutions in dynamic social networks—A Bayesian approach," *Mach. Learn.*, vol. 82, pp. 157–189, Feb. 2011.
- 848 [20] A. P. Iyer, L. E. Li, T. Das, and I. Stoica, "Time-evolving graph
  processing at scale," in *Proc. 4th Int. Workshop Graph Data Manag. Exp. Syst.*, 2016, pp. 1–6.
- P. Macko, V. J. Marathe, D. W. Margo, and M. I. Seltzer, "LLAMA:
  Efficient graph analytics using large multiversioned arrays," in *Proc. IEEE 31st Int. Conf. Data Eng.*, 2015, pp. 363–374.
- <sup>854</sup> [22] U. Khurana and A. Deshpande, "Efficient snapshot retrieval over historical graph data," in *Proc. IEEE 29th Int. Conf. Data Eng. (ICDE)*, 2013, pp. 997–1008.
- P. Kumar and H. H. Huang, "GraphOne: A data store for real-time analytics on evolving graphs," *ACM Trans. Storage (TOS)*, vol. 15, no. 4, pp. 1–40, 2020.
- <sup>860</sup> [24] Y. Jin, C.-F. Wu, D. Brooks, and G.-Y. Wei, "S<sup>3</sup>: Increasing GPU utilization during generative inference for higher throughput," in *Proc. Adv. Neural Inf. Process. Syst.*, 2023, pp. 18015–18027.
- F. T. Hady, A. Foong, B. Veal, and D. Williams, "Platform storage performance with 3D XPoint technology," *Proc. IEEE*, vol. 105, no. 9, pp. 1822–1833, Sep. 2017.
- 866 [26] O. Krestinskaya, A. P. James, and L. O. Chua, "Neuromemristive circuits
   867 for edge computing: A review," *IEEE Trans. Neural Netw. Learn. Syst.*,
   868 vol. 31, no. 1, pp. 4–23, Jan. 2020.

- [27] K. Wu, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau, "Towards and unwritten contract of Intel Optane SSD," in *Proc. 11th USENIX 870 Workshop Hot Topics Storage File Syst. (HotStorage)*, 2019, pp. 1–8.
- [28] C.-F. Wu, M.-C. Yang, Y.-H. Chang, and T.-W. Kuo, "Hot-spot suppression for resource-constrained image recognition devices with nonvolatile memory," *IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.*, 874 vol. 37, no. 11, pp. 2567–2577, Nov. 2018.
- [29] W. Cheong et al., "A flash memory controller for 15μs ultra-low-latency 876 SSD using high-speed 3D NAND flash with 3μs read time," in *Proc.* 877 *IEEE Int. Solid-State Circuits Conf. (ISSCC)*, 2018, pp. 338–340.
- [30] J. Zhang et al., "FlashShare: Punching through server storage stack from 879 kernel to firmware for ultra-low latency SSDs," in *Proc. 13th USENIX 880 Symp. Oper. Syst. Design Implement. (OSDI 18)*, 2018, pp. 477–492.
   [31] C.-F. Wu, Y.-H. Chang, M.-C. Yang, and T.-W. Kuo, "When storage 882
- [31] C.-F. Wu, Y.-H. Chang, M.-C. Yang, and T.-W. Kuo, "When storage response time catches up with overall context switch overhead, what is next?" *IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.*, vol. 39, no. 11, pp. 4266–4277, Nov. 2020.
- [32] C.-F. Wu, Y.-H. Chang, M.-C. Yang, and T.-W. Kuo, "How to steal CPU idle time when synchronous I/O mode becomes promising," in *Proc.* 61th ACM/IEEE Design Autom. Conf. (DAC), 2024.
- [33] R. F. Freitas and W. W. Wilcke, "Storage-class memory: The next storage system technology," *IBM J. Res. Develop.*, vol. 52, no. 4.5, pp. 439–447, 890 Jul. 2008.
- [34] G. W. Burr, B. N. Kurdi, J. C. Scott, C. H. Lam, K. Gopalakrishnan, and
   R. S. Shenoy, "Overview of candidate device technologies for storageclass memory," *IBM J. Res. Develop.*, vol. 52, no. 4.5, pp. 449–464, 894
   Jul. 2008. 895
- [35] (Intel, Santa Clara, CA, USA). Product: Intel Optane Persistent Memory 696 (DCPMM). (2020). [Online]. Available: https://www.intel.com/content/ 897 www/us/en/products/docs/memory-storage/optane-persistent-memory/ 698 overview.html 899
- [36] A. Sainio et al. "NVDIMM: Changes are here so what's next: Memory 900 computer summit." 2016. [Online]. Available: https://www.snia.org/sites/ 901 default/files/SSSI/NVDIMM 902
- [37] M. Saxena and M. M. Swift, "FlashVM: Virtual memory management 903 on flash," in *Proc. USENIX Annu. Tech. Conf. (USENIX ATC)*, 2010, 904 pp. 1–14. 905
- [38] S. Van Doren, "HOTI 2019: Compute express link," in *Proc. IEEE Symp.* 906 *High-Perform. Interconnects (HOTI)*, 2019, p. 18. 907
- [39] D. D. Sharma, "Compute express link (CXL): Enabling heterogeneous goad data-centric computing with heterogeneous memory hierarchy," *IEEE 909 Micro*, vol. 43, no. 2, pp. 99–109, Mar./Apr. 2023.
- [40] M. Jung, "Hello bytes, bye blocks: PCIe storage meets compute express 911 link for memory expansion (CXL-SSD)," in *Proc. 14th ACM Workshop* 912 *Hot Topics Storage File Syst.*, 2022, pp. 45–51.
- [41] H. Li et al., "Pond: CXL-based memory pooling systems for cloud 914 platforms," in *Proc. 28th ACM Int. Conf. Archit. Support Program. Lang.* 915 *Oper. Syst.*, 2023, pp. 574–587.
- [42] I. Oukid, J. Lasperas, A. Nica, T. Willhalm, and W. Lehner, "FPTree: A 917 hybrid SCM-DRAM persistent and concurrent B-tree for storage class memory," in *Proc. Int. Conf. Manag. Data*, 2016, pp. 371–386. 919
- [43] H.-Y. Cheng et al., "Future computing platform design: A cross-layer 920 design approach," in *Proc. Design, Autom. Test Eur. Conf. Exhib.* 921 (DATE), 2021, pp. 312–317. 922
- [44] I. Oukid, D. Booss, A. Lespinasse, W. Lehner, T. Willhalm, and 923 G. Gomes, "Memory management techniques for large-scale persistentmain-memory systems," *Proc. VLDB Endowment*, vol. 10, no. 11, 925 pp. 1166–1177, 2017. 926
- [45] J. Leskovec et al., "Stanford network analysis project." 2010. [Online]. 927 Available: https://snap.stanford.edu/ 928
- [46] T.-S. Lo, C.-F. Wu, Y.-H. Chang, T.-W. Kuo, and W.-C. Wang, "Space-*generation of the state of*
- [47] R. Chen, J. Shi, Y. Chen, B. Zang, H. Guan, and H. Chen, 933
   "PowerLyra: Differentiated graph computation and partitioning on 934 skewed graphs," ACM Trans. Parallel Comput., vol. 5, no. 3, pp. 1–39, 935 2019.
- [48] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, 937
   "PowerGraph: Distributed graph-parallel computation on natural 938
   graphs," in *Proc. 10th USENIX Symp. Oper. Syst. Design Implement.* 939
   (OSDI), 2012, pp. 17–30. 940
- [49] P. Vila, B. Köpf, and J. F. Morales, "Theory and practice of finding 941 eviction sets," in *Proc. IEEE Symp. Security Privacy (SP)*, 2019, 942 pp. 39–54.
- J. H. Ryoo, N. Gulur, S. Song, and L. K. John, "Rethinking TLB designs 944 in virtualized environments: A very large part-of-memory TLB," ACM 945 SIGARCH Comput. Archit. News, vol. 45, no. 2, pp. 469–480, 2017. 946
- [51] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, "Architecting phase change 947 memory as a scalable dram alternative," in *Proc. 36th Annu. Int. Symp.* 948 *Comput. Archit.*, 2009, pp. 2–13. 949