# iFKVS: Lightweight Key–Value Store for Flash-Based Intermittently Computing Devices

Yen-Hsun Chen<sup>10</sup>, Ting-En Liao<sup>10</sup>, and Li-Pin Chang<sup>10</sup>, Senior Member, IEEE

Abstract—Energy harvesting enables long-running sensing 2 applications on tiny Internet of Things (IoT) devices without 3 a battery installed. To overcome the intermittency of ambient 4 energy sources, system software creates intermittent computation 5 using checkpoints. While the scope of intermittent computation <sup>6</sup> is quickly expanding, there is a strong demand for data storage 7 and local data processing in such IoT devices. When considering 8 data storage options, flash memory is more compelling than 9 other types of nonvolatile memory due to its affordability and 10 availability. We introduce iFKVS, a flash-based key-value store 11 for multisensor IoT devices. In this study, we aim at supporting 12 efficient key-value operations while guaranteeing the correctness 13 of program execution across power interruptions. For indexing 14 of multidimensional sensor data, we propose a quadtree-based 15 structure for the minimization of extra writes from splitting 16 and rebalancing; for checkpointing in flash storage, we propose rollback-based algorithm that exploits the capabilities of 17 **a** 18 byte-level writing and one-way bit flipping of flash memory. 19 Experimental results based on a real energy-driven testbed 20 demonstrate that with the same index structure design, our 21 rollback-based approach obtains a significant reduction of 45% <sup>22</sup> and 84% in the total execution time compared with checkpointing 23 using write-ahead logging (WAL) and copying on write (COW), 24 respectively.

Index Terms—Checkpoint, flash storage, intermittent
 computation, key-value store.

#### 27

#### I. INTRODUCTION

<sup>28</sup> **S** MART Internet of Things (IoT) applications, such <sup>29</sup> **S** as building automation, transportation, healthcare, and <sup>30</sup> surveillance, involve a large number of distributed sensing <sup>31</sup> devices to monitor the environment and take action when nec-<sup>32</sup> essary. Typically, such tiny IoT devices are distributed over a <sup>33</sup> large-scale wireless network and deployed in remote locations <sup>34</sup> for long-lasting operation without human maintenance. To <sup>35</sup> meet the deploy-and-forget requirement, a promising direction <sup>36</sup> for the development of IoT devices is toward batteryless <sup>37</sup> design. Instead of draining energy from batteries, these devices <sup>38</sup> operate with ambient energy, which can be harvested from

Manuscript received 10 August 2024; accepted 10 August 2024. This work was supported in part by the National Science and Technology Council, Taiwan, under Grant MOST 110-2221-E-A49-029-MY3 and Grant NSTC 113-2221-E-A49-188-MY3. This article was presented at the International Conference on Embedded Software (EMSOFT) 2024 and appeared as part of the ESWEEK-TCAD special issue. This article was recommended by Associate Editor S. Dailey. (*Corresponding author: Li-Pin Chang.*)

The authors are with the Department of Computer Science, College of Computer Science, National Yang Ming Chiao Tung University, Hsinchu 30010, Taiwan (e-mail: eric246871@gmail.com; teliao0116@gmail.com; lpchang@cs.nycu.edu.tw).

Digital Object Identifier 10.1109/TCAD.2024.3443698

solar energy [1], radio signal energy, kinetic energy, and <sup>39</sup> thermal energy [2].

Energy-harvesting devices use capacitors as an energy 41 buffer. Because ambient energy is highly unstable, devices 42 risk losing all computation progress when the stored energy 43 is depleted. Checkpoints are therefore introduced to manage 44 the loss across power interruptions: at a proper timing, a checkpoint is committed to save the program context to 46 nonvolatile memory, and on power recovery, the latest check-47 point is restored to resume program execution. In particular, 48 continuous checkpoints are periodically committed [3], while 49 just-in-time (JIT) checkpoints are committed right before 50 the device halts due to insufficient energy [4]. In contrast, 51 atomic tasks commit changes to the global memory on their 52 completion [5]. With the proposed checkpoints and atomic 53 tasks, program execution is thus intermittent, i.e., applications 54 can be long-running across unexpected power interruptions. 55

IoT devices constantly sample readings from multiple 56 microsensors. Recent studies have shown that, instead of 57 uploading sensor data to the cloud for processing, IoT devices 58 can benefit from local data storage and processing: sensor data 59 can be digested and compressed locally to reduce the cost of 60 wireless data transmission [6], queries can be handled locally 61 in sensors for improved responsiveness [7], and in healthcare 62 applications, processing data locally in sensor devices enables 63 anonymization and access control of personally identifiable 64 data [8]. A multisensor IoT device acquires multidimensional 65 data, where each data record typically comprises a timestamp 66 and multiple sensor readings. Although a file system can easily 67 store and query time-series data, it cannot, however, handle 68 event-based queries efficiently. Such queries often involve 69 different conditions on multiple dimensions [9], e.g., "List all 70 events in the last week where the air temperature is above 71 40 °C and the relative humidity is above 60%." In contrast, 72 these queries are better handled by a key-value store through 73 multidimensional data indexing. 74

The inclusion of a key-value store in energy-harvesting 75 devices cannot succeed without considering the memory cost. 76 This is because, compared to working memory, data storage 77 demands a much larger space. We consider flash memory 78 because it offers a superior capacity per unit cost and is readily 79 available on many platforms. To the best of our knowledge, 80 this study is the first on the support of intermittency for flash-81 based key-value store. Now, in addition to fast key-value 82 operations, new design challenges arise regarding efficient 83 checkpoint operations: first, rewriting in flash is not possible 84 without erasure. With this constraint, to restore a checkpoint,

1937-4151 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information. <sup>86</sup> every change that occurs after the most recent checkpoint
<sup>87</sup> must be explicitly undone from flash memory. Second, in prior
<sup>88</sup> work, checkpointing involves only the program context but not
<sup>89</sup> the storage space. Restoring the program state and undoing
<sup>90</sup> flash changes separately without global coordination creates
<sup>91</sup> inconsistency between the program and storage.

This study presents iFKVS, a lightweight key-value store 92 93 for energy-harvesting devices. In addition to functional cor-94 rectness, iFKVS aims for a reduction in write cost because, 95 compared to reading, writing flash memory is more expensive terms of time and energy. iFKVS uses two techniques to 96 in 97 write flash: 1) byte logging and 2) one-way bit flipping (see <sup>98</sup> Section II-B). For indexing of multidimensional sensor data, <sup>99</sup> iFKVS employs a flash-efficient design of quadtrees [10], for which each tree node is managed as a tiny log space. This way, 100 while efficient insertions are possible through fine-grained, 101 102 out-of-place logging, existing data of a node remain intact for <sup>103</sup> subsequent checkpoint restoration.

Checkpointing for iFKVS is based on rollback, a backup-104 105 before-modify approach. iFKVS maintains a global undo log 106 to collect backups of tree nodes and program contexts in 107 chronological order. Thanks to our log-structured node design, when modifying a tree node, rather than making a full backup 108 of the node, iFKVS simply marks the flash address of the 109 110 most recent write to the node. On checkpoint restoration, 111 all data written after this mark will be undone from flash <sup>112</sup> memory. Now, when recovering from a power interruption, 113 iFKVS examines the global undo log, undoes node changes, 114 and restores the program context. The process of undoing 115 node changes is also optimized for flash memory: rather than 116 erasing changes made after the previous checkpoint, iFKVS <sup>117</sup> neutralizes these changes by zeroing them out using one-way 118 bit flipping. On the other hand, to commit a checkpoint, iFKVS 119 simply makes a backup of the program context and clears 120 up the global undo log. In summary, this work makes the 121 following contribution:

- 122 1) proposing a flash-efficient key–value index design for 123 multidimensional sensor data;
- 124 2) introducing a flash-efficient checkpoint algorithm to
   125 enable intermittent computation on a flash-based key–
   126 value store;
- 3) presenting a mechanism for global coordination betweenthe program state and flash state across checkpoints.

We successfully implemented our iFKVS design on Texas Instruments MSP430F5529 SoC. For performance evaluation and comparison, we also implemented the proposed tree structure on top of two additional checkpoint techniques, write-ahead logging (WAL) and copying on write (COW). Our results show that iFKVS achieved an overhead reduction by us to 84% in terms of the total execution time required to complete a workload under realistic power fail events.

### 137 II. BACKGROUND AND MOTIVATION

### 138 A. Intermittent Computation

<sup>139</sup> Batteryless IoT devices harvest ambient energy, store it <sup>140</sup> in a capacitor, and operate with this stored energy until <sup>141</sup> it is depleted. However, when running out of the stored



Fig. 1. Concept and design of intermittent computation. (a) Program progresses with checkpoints. (b) Memory organization of SRAM (working memory) and flash (storage).

energy, the devices lose all volatile program context, including 142 contents in the CPU registers and volatile memory. To avoid 143 re-execution from scratch, energy-harvesting devices timely 144 commit a checkpoint to preserve the program context. A piece 145 of nonvolatile memory is thus adopted to store the program 146 state for later restoration. Fig. 1(a) shows program execution 147 with checkpoints: The program commits a checkpoint at time 148  $t_1$  and continues to execute. Later at time  $t_2$ , the device ceases 149 to operate due to a power interruption. At time  $t_3$ , the capacitor 150 is sufficiently charged and the device restarts. By restoring 151 the prior checkpoint, the execution progress is reverted to that 152 at  $t_2$ , i.e., only the progress between  $t_1$  and  $t_2$  is lost. Here, 153 we refer to everything that happens before the checkpoint to 154 be pre-checkpoint, and all the others to be post-checkpoint. 155 For example, restoring the checkpoint effectively discards the 156 post-checkpoint progress between  $t_1$  and  $t_2$ . 157

In this study, we consider a typical memory organization, <sup>158</sup> which is readily available on many embedded platforms, as <sup>159</sup> shown in Fig. 1(b). The upper half is a piece of volatile SRAM <sup>160</sup> that serves as the working memory. The SRAM holds read-<sup>161</sup> write memory sections, including global variables and task <sup>162</sup> stacks. The lower half is a piece of large, nonvolatile flash <sup>163</sup> memory, commonly referred to as NOR flash. Because the flash <sup>164</sup> is byte-addressable and capable of execute-in-place (XIP), it <sup>165</sup> serves as a unified memory space for code storage (through <sup>166</sup> XIP) and sensor data storage (through key–value store). *A* <sup>167</sup> small portion of the flash memory is reserved for the program <sup>168</sup> context backup and the flash undo log. <sup>169</sup>

With continuous checkpoints [3], programs continue to 170 execute after committing a checkpoint. On power recovery, 171 post-checkpoint writes must be undone from flash memory. 172 In contrast, with JIT checkpoints [4], programs suspend right 173 after committing a checkpoint. However, energy estimation is 174 not always correct [11], so JIT checkpoints are not free from 175 post-checkpoint progress and a flash undo algorithm is always 176 necessary. 177

#### B. Flash Memory Characteristics

*Performance:* In prior studies, both byte-addressable flash <sup>179</sup> memory (specifically, NOR flash) and ferroelectric RAM <sup>180</sup> (FRAM) are often considered in the design of energy- <sup>181</sup> harvesting devices. Table I shows a comparison between <sup>182</sup> the two. Both flash and FRAM can store executable code <sup>183</sup> because they are byte-addressable and pin-compatible with <sup>184</sup>

TABLE I Comparison of Flash and FRAM

|                        |        | (NOR)                                   | ) Flash     | FRAM              |             |  |
|------------------------|--------|-----------------------------------------|-------------|-------------------|-------------|--|
| Read write size        |        | Byte,                                   | word        | Byte, word        |             |  |
| eXecute-In-Place       |        | Yes                                     |             | Yes               |             |  |
| Erase size             |        | 512B s                                  | egment      | _                 |             |  |
| Price (256KB)          |        | USD 0.38 (WinBond)                      |             | USD 7.4 (Fujitsu) |             |  |
|                        |        | Latency                                 | Energy      | Latency           | Energy      |  |
| Read                   |        | 600 ns/byte                             | 6 nJ/byte   | 250 ns/byte       | 1.4 nJ/byte |  |
| Write                  |        | 18 $\mu$ s/byte                         | 209 nJ/byte | 250 ns/byte       | 1.7 nJ/byte |  |
| Erase                  |        | 50 $\mu$ s/byte                         | 420 nJ/byte |                   |             |  |
| 300<br>200<br>100<br>0 | $\int$ | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | M           | $\sim \sim$       | M           |  |
| (                      | ) 10   | 20 30                                   | 40 50       | 60 70 8           | 0 Time (s   |  |

Fig. 2. Outdoor solar power collected by a 5.4-cm<sup>2</sup> solar panel.

<sup>185</sup> the processor. Flash supports byte-level writes and is thus <sup>186</sup> highly friendly for managing key–value index structure. While <sup>187</sup> FRAM is erase-free, flash involves erasure of segments to <sup>188</sup> reclaim writable memory space. We measured the latency and <sup>189</sup> energy performance of flash memory and FRAM using the <sup>190</sup> EnergyTrace feature from Texas Instruments MSP430F5529 <sup>191</sup> and MSP430FR5994, respectively. The lower half of Table I <sup>192</sup> shows that the read latencies of flash and FRAM are within <sup>193</sup> the same order of magnitude, although flash is slower than <sup>194</sup> FRAM. In contrast, flash writes are slower and consume more <sup>195</sup> energy and require subsequent erasure.

*Cost:* Table I shows that flash holds a significant advantage right in terms of storage design, as it costs only about 5% of the price of FRAM for the same 256-kB size. As of mid-2024, a 10-mF capacitor costs U.S. \$3, and a 128-kB flash-based MSP430 SoC costs U.S. \$9, making FRAM less attractive for storage-demanding sensing applications.

Applicability: The use of flash memory is subject to the 202 <sup>203</sup> scenario and specification of the target application [12]. Consider an outdoor sensing application requiring at least 512 204 <sup>205</sup> kB of data storage. The first issue is the ambient power. Fig. 2 206 depicts that outdoor solar power from a small panel often <sup>207</sup> exceeds 100 mW.<sup>1</sup> In our experiments, a flash-based MSP430 SoC operates at a duty cycle of 82% under 3.3 V×7 mA≈23 208 W with a 10-mF capacitor, so this ambient power is more m 209 than sufficient to drive the flash-based SoC. The second issue 210 on capacity. As of 2024, within the lineup of MSP430is 211 sed SoCs, the maximum embedded FRAM size is 256 kB, 212 ba whereas embedded flash can reach 512 kB. External memory 213 considered for size expansion [13], [14]. While the largest 214 is 215 external (serial) FRAM from major distributors is 16 Mb, 216 the largest external NOR flash reaches 1 Gb. Nevertheless, <sup>217</sup> when the data writing rate is extremely high [15] or when the ambient power is weak, flash is unlikely a viable option. 218

*Operations:* Index management heavily involves finegrained writes, e.g., adding a pointer or changing a flag bit. In flash, a bit of 1 can be changed to 0, but once a bit is set to 0, it cannot be changed. Bits in flash can only be reset to 1 through segment erasure. Byte writing in flash is feasible if the target



Fig. 3. Timeline of intermittent execution and instantiation of inconsistency between program and storage.

address is unwritten. The system software can individually set <sup>224</sup> a bit to 0 using a bitmask. For example, overwriting 0x55AA <sup>225</sup> with 0xFF77 results in 0x5522. This operation, called *one-* <sup>226</sup> *way bit flipping*, will be used in our approach to neutralize <sup>227</sup> post-checkpoint data and to change flag bits. <sup>228</sup>

#### C. Motivational Example

A storage-enabled energy-harvesting device operates on not 230 only the program context but also the storage state. However, 231 to support intermittent computation, existing checkpoint algo- 232 rithms are concerned with the program context only, and little 233 effort has been made toward how to checkpoint on both. 234 Without proper coordination between the two, after restoring 235 a checkpoint on power recovery, an energy-harvesting device 236 risks state inconsistencies between program and storage. 237

Fig. 3 demonstrates an inconsistency resulted from 238 program-only checkpointing. The lower half depicts the flash 239 state, in which a segment, a unit for flash erasure, has four 240 words at addresses from 0F0h to 0F6Ch. The program context, 241 depicted in the upper half, is in SRAM. Let the program 242 maintains a write pointer that refers to the next available flash 243 address for writing. Now, at time  $t_1$ , words X and A have 244 been written to flash and the write pointer refers to 0F4h. The 245 program commits a checkpoint at time  $t_2$  and then at time 246 t<sub>3</sub>, it samples word B from a sensor, writes B to 0F4h, and <sup>247</sup> advances the write pointer to 0F6h. After time  $t_3$ , the device 248 undergoes a power outage. At time t4, the device restores 249 the prior checkpoint and reverts the program context to the 250 state at time  $t_2$ . Notice that the write pointer, part of the 251 program context, is returned to 0F4h, and this address has 252 been occupied by word B. At time  $t_5$ , the program samples 253 a new word C from the sensor and attempts to write C to  $_{254}$ the already-written address 0F4h. As flash memory prohibits 255 in-place updating without prior erasure, the write results in 256 erroneous contents at address 0F4h. 257

Power events can also cause storage metadata inconsistencies, as noted in [16]. Such errors can be avoided by extending the semantics of checkpoint to the storage. Checkpointing the semantics of checkpoint to the storage. Checkpointing the erase-free term flash memory is challenging because unlike the erase-free term nonvolatile memory such as FRAM, flash memory cannot the rewritten without prior erasure. Restoring a checkpoint to the rewritten without prior erasure. Restoring a checkpoint term term term term term term terms of segments, and in this example, the term term term term terms of segments, and in this example, term terms of terms of terms terms

<sup>&</sup>lt;sup>1</sup>The power traces are used for evaluation in Section V-F.

<sup>268</sup> the checkpoint algorithm must first create a backup of words <sup>269</sup> X and A, erase the segment (0F0h to 0F7), and then copy <sup>270</sup> words X and A back. In this study, our design goal is <sup>271</sup> to maintain synchronization between program and storage <sup>272</sup> for checkpoint operations, while also exploring innovative <sup>273</sup> methods to efficiently neutralize post-checkpoint writes from <sup>274</sup> flash memory.

## 275 D. Related Work

<sup>276</sup> With energy harvesting, long-term computation employs <sup>277</sup> checkpoints to survive power interruptions with managed <sup>278</sup> progress loss. There are different types of checkpoint methods.

Continuous checkpoints, illustrated by Choi et al. [3]
 and Maeng and Lucia [17], are committed by the system
 software in a regular time interval and restored on power
 recovery.

20 Compiler-directed checkpoints, including the technique proposed by Liu et al. [18], rely on compile-time analysis of write-after-read (WaR) dependencies among variables. Checkpoints are committed at the boundaries of idempotent program blocks.

Atomic tasks, pioneered by Maeng et al. [5], commit
 local changes to the global memory upon their comple tion.

4) JIT checkpoints, demonstrated by Maeng and Lucia [4],
 involve additional hardware to monitor the voltage of
 the capacitor. A checkpoint is committed only when the
 voltage level is critically low. Our key-value store is
 designed to be independent of the checkpointing method.

While energy is a sacred resource in energy-harvesting 296 <sup>297</sup> devices, the introduction of advanced energy buffers [11], [19] <sup>298</sup> has transformed them from dumb, wimpy devices into capable 299 platforms. For example, Fraternali et al. [20] demonstrated wireless communication with Bluetooth low energy (BLE) 300 battervless devices. Gobieski et al. [21] showed the on 301 feasibility of handwriting recognition and keyword spotting 302 based on compact CNN models powered by energy harvesting. 303 Montanari et al. [22] further optimized the tradeoff between 304 305 energy usage and inference accuracy using multiresolution and multiexit CNN models. Mendis et al. [23] proposed 306 307 intermittency-aware architecture search for neural networks to boost the chance of successful interference. Smart applications 308 309 need extensive computation and data access, making local 310 storage and retrieval of sensor data increasingly critical.

The use of flash memory and other nonvolatile memory options enables the local storage of sensor data. Dai et al. [24] demonstrated a flash-based, log-structured file system for and sensor devices. Exploiting the append-only property of sensing applications, the file system preallocates a dedicated flash space for each file to grow. Tsiftes et al. [25] presented Coffee arr to enhance this approach using micro-logs to optimize small updates. Mazumder and Hallstrom [15] proposed LoggerFS to are high-speed write buffer and NAND flash is the final storage high-speed write buffer and NAND flash is the final storage better handle event- or value-based queries. Lin et al. [6] presented flash-efficient implementation of a hash table and



Fig. 4. Index structures for multidimensional data: (a) grid file, (b) R-tree, and (c) quadtree.

a grid file for sensors. Fevgas and Bozanis [26] proposed <sup>324</sup> buffering random writes for efficient insertion into flash-based <sup>325</sup> grid files. Chang and Hsu [27] introduced soft lists, a flashnative implementation of skip lists with efficient flash garbage <sup>327</sup> collection. However, in these studies, checkpointing is either <sup>328</sup> not discussed at all or considered too expensive. <sup>329</sup>

iNVMFS, proposed by Wu et al. [16], is a lightweight, <sup>330</sup> checkpoint-enabled file system. However, iNVMFS is heavily <sup>331</sup> based on the in-place updating capability of erase-free FRAM, <sup>332</sup> which makes it incompatible with flash memory. Our work <sup>333</sup> deviates from iNVMFS by addressing the unique constraints <sup>334</sup> of flash operations. In addition, our approach handles queries <sup>335</sup> on multidimensional data while iNVMFS does not. <sup>336</sup>

#### III. INDEXING AND CHECKPOINTING

337

341

This section introduces multidimensional data indexing and <sup>338</sup> storage checkpointing, laying the foundation for our approach. <sup>339</sup> Experienced readers can proceed to Section IV. <sup>340</sup>

#### A. Indexing of Multisensor Data

For long-term sensor data storage, a file system conveniently <sup>342</sup> creates day directories for hour files, in which sensor data are <sup>343</sup> logged. This permits simple retrieval of data within a specified <sup>344</sup> time interval. An IoT device collects multidimensional data <sup>345</sup> from multiple sensors. However, queries on multisensor data <sup>346</sup> involve distinct conditions on different dimensions, and to <sup>347</sup> process such queries, a file system has no choice but inspects <sup>348</sup> every record in all corresponding hour files. <sup>349</sup>

There have been excellent index designs for 350 multidimensional data, and for the ease of illustration, we 351 consider a dimension size of two. Fig. 4(a) shows a grid 352 file [6], in which data items are distributed among fixed-sized 353 cells of the grid on the XY plane. As cells occupy the same 354 size, address calculation and data transfer of cell data are 355 highly efficient. A drawback of grid files is the poor space 356 utilization under a highly uneven data distribution, and in 357 addition, cell overflows may trigger expensive rehashing of 358 the grid. Fig. 4(b) depicts an R-tree [28], which manages 359 the bounding boxes in a tree-based hierarchical manner. Like 360 B-trees, R-trees split and merge for height balancing and 361 space compacting. However, these self-balancing operations 362 introduce a large amount of writes. In addition, an R-tree may 363



Fig. 5. Various techniques for checkpointing in storage. (a) Rolling back. b) WAL. (c) COW. (d) Log-structuring.

<sup>364</sup> search multiple paths on query because it permits a partial <sup>365</sup> overlap between boxes.

Fig. 4(c) shows a quadtree [10], in which a node represents are a region. When a node is full, the region it represents is partitioned into four quadrants, each of which is associated with a mew child node. In this study, our key–value store is designed based on quadtrees because a few of their properties well match our purpose: unlike grid files, in which all cell spaces have been preallocated, quadtrees creates new nodes only when necessary. Unlike R-trees, which permit overlapping between bounding boxes and involve extra writes for selfbalancing, quadtrees efficiently search in disjoint quadrants and do not use extra writes for rebalancing.

## 377 B. Checkpointing in Storage

The checkpointing in working memory is typically based on variables, whereas storage checkpointing involves larger objects (such as tree nodes or data blocks) to save metadata sat space. Existing techniques, as discussed below, share a prinsaz ciple that avoids to modify pre-checkpoint data objects.

Fig. 5(a) illustrates the operation of rolling-back, a backup-383  $_{384}$  before-modify approach. Suppose that we modify node C in 385 the tree structure. In the first step, the pre-checkpoint node  $_{386}$  C is copied to a backup space and is ready for in-place  $_{387}$  updating. In the second step, the original node C is updated in <sup>388</sup> place and becomes node C'. Committing a checkpoint involves 389 discarding the backup node, whereas restoring the previous 390 checkpoint necessitates replacing node C' with node C, effec-<sup>391</sup> tively reverting to the previous state. In contrast, Fig. 5(b) shows WAL and how it handles the same update. In the first 392 393 step, the up-to-date node C' is added to a log space (also known as the redo log) without altering the pre-checkpoint 394 <sup>395</sup> node C. Committing a checkpoint involves overwriting node Cwith node C' (step 2) followed by clearing the log. Restoring 396 397 the prior checkpoint can be done by discarding all post-398 checkpoint data, i.e., clearing the log. Both rolling back and WAL involve in-place node updating in their second steps, 399 which seems prohibited in flash. However, this operation can 400 401 be implemented by node logging, as will be shown in later 402 sections.

Fig. 5(c) depicts COW. The update to node *C* is handled in an out-of-place manner to avoid modifying pre-checkpoint data. After this, the parent node *A* must also be updated to refer to node C', and the update is again handled in an dor out-of-place manner. The copying of nodes, operated by the wandering tree algorithm, propagates upstream until reaching the root or a post-checkpoint node. *A* checkpoint is committed



Fig. 6. Node structure. Each node is a tiny log space. (a) Node structure. (b) Leaf node. (c) Internal node.

by discarding nodes *A* and C, which are unreachable from the <sup>410</sup> new root A'. The prior checkpoint is restored by discarding all <sup>411</sup> post-checkpoint nodes, i.e., nodes A' and C'. Fig. 5(d) shows <sup>412</sup> log-structuring, which performs out-of-place writing always. <sup>413</sup> *A* signature of log-structuring is the use of a mapping table <sup>414</sup> in the working memory, which eliminates the necessity for <sup>415</sup> path copying, as seen in COW. Here, updating to node *C* is <sup>416</sup> accomplished by appending a new node C' and then updating <sup>417</sup> the mapping table accordingly. The mapping table must be <sup>418</sup> backed up and restored on checkpoint operations. <sup>419</sup>

We implemented rolling back, WAL, and COW in our experimental study. Notice that log structuring is not considered 421 because the tiny SRAM of the IoT platform that we use cannot 422 afford the space overhead of the node mapping table. 423

# IV. LIGHTWEIGHT KEY–VALUE STORE WITH EFFICIENT 424 CHECKPOINT SUPPORT 425

#### A. Node and Tree Structure

To achieve efficient query processing with a reduced write <sup>427</sup> frequency, we propose a flash-efficient implementation of <sup>428</sup> quadtrees. *A* quadtree is composed by nodes, and in our design, <sup>429</sup> nodes contain pointers only, as shown in Fig. 6(a). *A* node is <sup>430</sup> an array of pointer slots, and because flash memory permits <sup>431</sup> byte writing, a pointer slot can be either available, in-use, or <sup>432</sup> obsolete (neutralized). An in-use pointer can be neutralized <sup>433</sup> by writing zeros through one-way bit flipping, as will be <sup>434</sup> discussed later. Our design treats a node as a tiny logging space <sup>435</sup> of pointers. This design greatly aids checkpoint operations <sup>436</sup> because new pointers are inserted in the order of time, so it <sup>437</sup> will be straightforward to distinguish pre-checkpoint pointers <sup>438</sup> from post-checkpoint ones.

Quadtrees always insert new data to leaf nodes. Fig. 6(b) 440 shows that two new records are inserted to a leaf node, and 441 the first two slots are allocated to pointers referring to the 442 data records. The pointer slots of a leaf node are sequentially 443 allocated to refer to new records, and when all pointer slots 444 have been used, a node-split procedure is taken. Let the 445 dimension size be two for the purpose of illustration. As 446 Fig. 6(c) shows, after the split, the original node becomes the 447 parent of four new leaf nodes (NE to NW). Here, we take a 448 few measures to simplify the node structure for saving flash 449 writes. First, to save pointers, we propose allocating the four 450 child nodes in a contiguous flash space so that the parent 451 node uses only one pointer to refer to all the child nodes. 452 Second, on node split, the node space is partitioned into equal 453 subspaces without referring to a pivot, and this saves another 454 pointer. Note that if a high data skewness is anticipated, a 455 pivot pointer can be added for pivot-based splitting. Third, 456



Fig. 7. Our log-and-mark approach. The write mark indicates the current write address of the log at checkpoint committing.



Fig. 8. Data layout in flash (upper half) and contents in node *A* and the undo log (lower half). Gray portions are post-checkpoint data.

<sup>457</sup> we introduce *lazy split*, which retains all data pointers of a <sup>458</sup> node during splitting to avoid unnecessary flash writes. In <sup>459</sup> contrast, traditional quadtrees require migrating data items <sup>460</sup> from a parent node to its child nodes during a split.

#### 461 B. Global Undo Log

He2 Because key-value operations involve writing to flash, He3 we detail the design of our global undo log and explain He4 how it operates. Fig. 7 depicts the core idea of our design, He5 the log-and-mark approach: flash writing is always handled He6 through logging. Checkpoint committing marks the current He7 write address of the log, and all subsequent writes that occur He8 after this mark produce post-checkpoint data. Writing to flash He9 is subject to two simple rules: 1) pre-checkpoint data cannot He7 be modified and 2) post-checkpoint data must be undone from He7 flash on checkpoint restoration.

Fig. 8 shows an example of the flash-space layout. *A* reserved area appears at the beginning of the flash, and this area contains the backup space for the program state in ar5 addition to the global undo log. In our design, the main flash ar6 space is divided into two areas, one for tree nodes and the ar7 other for data records. The two areas are both log spaces. Node ar8 allocation starts at the lowest address, and record allocation begins at the highest address. The two disjoint areas grow the toward each other. By separating nodes from records, pointers the can greatly reduce the number of bits needed to refer to area objects, as they only need to encode object offsets instead of the addresses.

Writing to the node area and the record area follows the log-and-mark approach. The lower half of Fig. 8 shows the global undo log, in which all write marks are collected. In particular, the node area is associated with write mark w<sub>n</sub>, which indicates the current write address of the node area when the most recent checkpoint is committed. In other words, everything that appears before w<sub>n</sub> is considered pretice the pretice the technic t



Fig. 9. Write decision flows.

 $w_n$  is classified as post-checkpoint, as represented by node B. <sup>492</sup> Accordingly, the record area is associated with a write mark <sup>493</sup>  $w_r$  as the boundary between pre-checkpoint records and postcheckpoint ones. Notice that because tree nodes themselves are <sup>495</sup> tiny logging spaces, a node may contain both pre-checkpoint <sup>496</sup> data and post-checkpoint ones. For example, although node <sup>497</sup> *A* is in the pre-checkpoint node area, it accepts new pointers <sup>498</sup> after the latest checkpoint. To reflect this, a write mark  $w_A$  is <sup>499</sup> created for node A, marking that everything appears after  $w_A$  <sup>500</sup> a piece of post-checkpoint data. <sup>501</sup>

Fig. 9 details the write control flow. For example, the fourth 502 decision flow shows the procedure to add a new pointer to 503 node A in Fig. 8. Basically, when writing to an area, our design 504 examines whether this is the first write to the area after the 505 most recent checkpoint. A write mark is created for the area 506 only if it is the first write to the area, and the write mark is 507 added to the global undo log. No write mark will be created 508 on subsequent writes. Writing to an existing tree node follows 509 the same logic. Consequently, a node has a write mark in the 510 global undo log only if it has been modified since the most 511 recent checkpoint. Our design uses a tiny hash table in the 512 working memory to efficiently check for the presence of a 513 write mark in the undo log. As will be detailed later, during 514 checkpoint restoration, the write marks aid the identification 515 of post-checkpoint data, e.g., the gray portions in Fig. 8. 516

# C. Key-Value Operations

A query on a key–value store is either a *get* or a *scan*. A get 518 operation looks for an exact match; a scan operation, or a range 519 query, returns all data records that satisfy a set of conditions 520 across different dimensions. Scans are more useful to eventbased queries. Our tree index follows the original quadtree 522 algorithm for query processing with the following exceptions: 523 because our nodes are log spaces, records associated with a 524 node are unsorted, requiring examination of all records during 525 query processing. In addition, with our lazy split, internal 526 nodes are not empty, so their contents must also be inspected 527 for matches. 528

517

Write-oriented operations include *put*, *update*, and *delete*. A <sup>529</sup> put operation inserts a new record. To handle a put request, <sup>530</sup> our tree index first writes a new data record and then modifies <sup>531</sup> the tree index. The index modification includes locating the <sup>532</sup> leaf node to insert, conducting node split if necessary, and <sup>533</sup> adding a new record pointer to the target node. Here, to support <sup>534</sup>



Fig. 10. Global undo log (a) right after checkpoint commit and (b) after multiple post-checkpoint writes to the node area.

535 checkpoint operations, writing to flash must comply with the <sup>536</sup> principle of our log-and-mark approach. As Fig. 9 shows, the 537 first and the second decision flows show how to write a new <sup>538</sup> node or a new record, while adding a new pointer to an existing node goes through the third or the fourth decision flow. As 539 will be explained in later sections, our checkpoint method is 540 based on rollback. The original rolling-back method requires 541 create a backup of a node on the first post-checkpoint 542 to  $_{543}$  modification to the node [see Fig. 5(a)]. Interestingly, because 544 our node structure is a log, post-checkpoint writes do not affect 545 pre-checkpoint data in nodes. In other words, the node backup 546 step of rolling back is implemented simply by creating a node write mark and adding it to the global undo log. 547

In this study, random deletion and update of existing data records are not considered, as they are less useful to sensing plications [6], [15], [24]. Instead, our flash cleaning policy removes expired data from flash, as will be shown later.

#### 552 D. Checkpoint Operations

Commit: Checkpoints are committed periodically or on 553 <sup>554</sup> low-power events. To commit a checkpoint, a backup of the 555 program context, including the CPU registers, data/bss section, 556 and stack section, are written to a reserved flash space (see Fig. 8). The commit procedure is then finished by writing 557 commit record to the global undo log. A commit record а 558 559 contains a pointer referring to the program backup, a commit 560 signature, and a checksum of the record. Now, because our method for checkpointing in flash is based on rollback, when 561 562 a commit record has been written, all write marks in the <sup>563</sup> global undo log are obsolete (since nothing needs to be rolled 564 back). In our design, the global undo log is a circular buffer <sup>565</sup> composed by a set of flash segments. Fig. 10(a) shows a global <sup>566</sup> undo log of two segments, in which a new checkpoint has <sup>567</sup> been committed by writing commit record  $c_6$ . In this example, 568 all write marks before the final commit record  $c_6$  can be discarded. Here, segment 1 can be erased but not segment 2, 569 570 because commit record  $c_6$  must remain valid.

*Restore: A* checkpoint is restored when the device recovers from a power interruption. The restoration procedure first scans the global undo log for the last commit record. *A* write mark appearing after the last commit record is a pointer to the start of post-checkpoint data. These data, located in the node area, the record area, or in a tree node, must be undone from flash. For example, in Fig. 10(b), write mark  $w_7$  is added to the global undo log to indicate that new nodes have 578 been written to the node area after the latest checkpoint. To 579 restore the flash state, the undo procedure erases all the postcheckpoint nodes from the node area. Because segment x + 1 581 contains both pre-checkpoint nodes and post-checkpoint ones, 582 it undergoes *erase-based undo*: The undo procedure copies the pre-checkpoint nodes to a backup space, erases the segment, 584 and copies the nodes back. Segment x + 2 is erased directly 585 because it contains post-checkpoint nodes only. This procedure then restores the program context to resume execution. 587

The erase-based undo procedure resets the flash space 588 occupied by post-checkpoint data, ensuring that the flash 589 state is precisely reverted to its state at the latest checkpoint. 590 This involves an extra overhead of data copying. To avoid 591 this overhead, we propose relaxing the definition of global 592 consistency to ensure that the program never mistakenly writes 593 to a flash space that has already been written. To achieve 594 this, we introduce *discard undo* to neutralize post-checkpoint 595 data. Recall that flash is capable of one-way bit flipping. 596 i.e., individual bits can be zeroed out and this operation is 597 idempotent. Now, on checkpoint restoration, for each write 598 mark found in the global undo log, our approach scans flash 599 space after the mark and writes zero until an erased byte 600 (whose value is 0xff) is encountered. The first free addresses 601 of the node area and the record area are updated accordingly. 602 This procedure applies to tree nodes as well, and in subse- 603 quent key-value operations, zero (neutralized) pointers will be 604 ignored. 605

The undo log plays an essential role to the checkpoint 606 correctness. We safeguard write marks and commit records 607 using checksums. During a reboot scan, if the last log object 608 is a valid commit record, then both the program and storage 609 have been checkpointed; otherwise, both will be restored. For 610 the former case, recall that a commit record has a pointer to 611 the program backup, so the device simply restores the backup 612 and resumes execution. For the latter case, only the last log 613 object might fail the integrity check due to interrupted writing. 614 A corrupted commit record is discarded, and a broken write 615 mark is also discarded as its corresponding post-checkpoint 616 data are not yet written. After this, valid write marks and the 617 prior commit record in the log [e.g.,  $w_7$  and  $c_6$  in Fig. 10(b)] 618 are used for flash undo and program restoration, respectively. 619 For flash undo, while the discard undo (our main proposal) 620 is idempotent, the erase-based undo requires a reserved flash 621 space with a checksum-protected header to make the copy- 622 erase-copyback procedure fail-safe. 623

# E. Flash Space Cleaning

As flash prohibits in-place updating, new data are written to free space. Over time, the amount of free space becomes insufficient and garbage collection must be involved. Conventional copy-based garbage collection strategically selects a segment for erasure, and before erasing the segment, all valid data must be migrated to another free space. In spite of the extra overhead of data movement, this procedure requires an extra layer of indirection because data movement silently changes the flash addresses of data objects.



Fig. 11. Our testbed for energy-driven experiments. The target board runs our key-value store.

Random deletion and update of sensor data are not useful 634 635 in sensing applications [6], [15], [24]. In contrast, queries are 636 more concerned with recent events [9], so old data can be 637 expired and deleted from local storage. We propose a copy-638 free, partition-based flash cleaning method. The entire flash 639 memory is divided into a few equal-sized partitions, each of 640 which is managed by an independent instance of our tree 641 index. With this design, query processing involves all tree 642 instances. Partitions are used in a first-in-first-out manner for writing: new data are written to the current partition, and when 643 644 it is full, the oldest partition is erased. Notice that because 645 the oldest partition contains a pre-checkpoint tree instance, the 646 partition is marked but cannot be erased until a new checkpoint 647 has been committed. This partition-based method not only 648 eliminates the need for data copying but also ensures wear 649 leveling in flash memory.

Both the performance of insertion and query scale with beformation the quadtree algorithm. The overhead of commit is bounded by the log size due to the recycling of obsolete log segments, and the overhead of restore depends on the amount of postbeformation the checkpoint data for undo, i.e., the checkpoint interval length.

# V. EXPERIMENTAL RESULTS

#### 656 A. Experimental Setup

655

1) Parameters and Metrics: We implemented our 657 658 approach, called iFKVS, based on a Texas Instruments 659 LaunchPad. The platform involves an MSP430F5229 SoC, which is equipped with a 16-bit MSP430 embedded processor. 660 The processor is rated at 8 MHz and is integrated with 8 661 662 kB of SRAM and 128 kB of flash. The flash segment size 512 bytes. Our iFKVS implementation uses less than 200 663 is 664 bytes of SRAM for the volatile program context, including the 665 read-write sections for data, bss, and stack. Furthermore, its 666 executable code in flash is less than 7 kB. In addition to the 667 executable binary, the embedded flash is shared by the global 668 undo log and our key-value store. In particular, the latter 669 part uses 80 kB, which is partitioned into four equal-sized 670 partitions. The node size and the record size are 64 and 8 671 bytes, respectively.

We conducted experiments on an energy-driven testbed, as depicted in Fig. 11. The target board runs our key–value store, and its power source involves a power supply and a 10-mF capacitor. The power supply, a Keysight E36312A, delivers 7 mA at 3.3 V to the system. In this configuration, the target board draws more power than the power supply can provide, resulting in the target board remaining nonfunctional until the

 TABLE II

 CHECKPOINT DESIGNS OF NVRAM-BASED FILE SYSTEMS

|             | Metadata | Data blocks               |
|-------------|----------|---------------------------|
| iNVMFS [16] | WAL      | COW with in-place writing |
| BPFS [30]   | COW      | COW with atomic writes    |
| PMFS [31]   | RB       | COW on large updates      |

capacitor has charged sufficiently. *A* separate controller board <sup>679</sup> (powered by an adaptor) monitors the voltage of the capacitor <sup>680</sup> through its ADC. The capacitor discharges when the target <sup>681</sup> board is in operation. When the capacitor voltage drops below <sup>682</sup> 2.3 V, the controller turns off the transistor switch to detach <sup>683</sup> the capacitor from the target board for charging. This also <sup>684</sup> causes a blackout to the target board. If the capacitor voltage <sup>685</sup> raises to 3.3 V during charging, the controller turns on the <sup>686</sup> transistor switch to attach the capacitor to the target board, <sup>687</sup> and the target board is powered on. <sup>688</sup>

The experimental dataset is constructed based on the 689 database of real weather stations [29]. Each data record in 690 our dataset consists of a timestamp, a relative humidity, and 691 a temperature level. According to the database, we set the 692 relative humidity between 0% and 100% at a resolution of 693 0.1% and the temperature level between -20 °F and 100 °F at 694 a resolution of 0.1 °F. We use fixed-point numbers to represent 695 the humidity and temperature values. Based on the value 696 intervals, we utilize a uniform distribution to generate twenty 697 thousand records for one dataset and a normal distribution for 698 the same number of records in the other dataset. The quadtree 699 does not index sequential timestamps to avoid skewness. For 700 multidimensional queries, the timestamp of matched records is 701 inspected separately. During experiments, a dataset is entirely 702 written to the key-value store, and a checkpoint is committed 703 every 100 insertions. Checkpoints are restored when the target 704 recovers from asynchronous power events. 705

Our primary performance metric is the total execution time, <sup>706</sup> which includes not only the overhead of inserting the entire <sup>707</sup> dataset into the key–value store but also the latency contributed <sup>708</sup> by periodic checkpoint committing and restoring a checkpoint <sup>709</sup> in the event of power recovery. The shorter the total execution <sup>710</sup> time is, the lower the overheads of key–value operations and <sup>711</sup> checkpoint operations are. We also report the latency of each <sup>712</sup> type of operation for analysis. <sup>713</sup>

2) *Methods Under Evaluation:* While we are not aware 714 of directly comparable prior studies, we select the NVRAM-715 based file systems in Table II as evaluation baselines, with 716 necessary modifications for flash compatibility. 717

iNVMFS [16] employs WAL on file system metadata and 718 COW on file data blocks. As previously shown in Fig. 5(b), 719 post-checkpoint writes are appended to a redo log and later 720 written back to their destination addresses. The writing-back 721 procedure is idempotent on flash, allowing it to be safely 722 repeated if interrupted. We revise iNVMFS by 1) disabling 723 in-place overwriting of post-checkpoint data and 2) dropping 724 the COW feature as sensor data records are never modified. 725

BPFS [30] is based on COW, as shown in Fig. 5(c). BPFS 726 exploits NVRAM atomic writes to avoid path copying on 727 small updates. As atomic writing is not available in flash, 728



Fig. 12. Overall performance under different data distributions. (a) Total execution times. (b) Power event counts.

729 we revise BPFS by 1) logging small updates in a post-730 checkpoint node and 2) copying a pre-checkpoint node on its 731 first update. Path copying consumes free space at a high rate, 732 leaving many obsolete nodes in flash memory. After filling 733 up the current flash partition, the revised BPFS initiates copy-734 based compaction if the partition's space utilization is lower 735 than a threshold. The revised BPFS achieve a level of space 736 utilization comparable with other methods.

PMFS [31] uses rollback on metadata and COW on data
blocks. PMFS performs rollback through in-place overwriting
because NVRAM is erase-free, but it degrades into erase-based
undo for flash compatibility. Instead of modifying PMFS, we
compare iFKVS with erase-based undo against iFKVS with
discard undo (our main proposal) in Section V-E.

<sup>743</sup> We refer to our iFKVS as RB for its rollback-based design.
<sup>744</sup> Similarly, we denote the revised iNVMFS as WAL, and the
<sup>745</sup> revised BPFS as COW. All these methods share the quadtree
<sup>746</sup> structure and the partition-based space cleaning policy.

#### 747 B. Total Execution Time

Fig. 12(a) shows the total execution time of RB (equiv-748 749 alently, our iFKVS), WAL, and COW. The total execution 750 time involves the time to insert all the 20000 data records and the time to commit a checkpoint every 100 insertion 751 752 operations. In addition, the time also includes the overhead to restore a checkpoint on power recovery. Results show 753 754 that RB completes the workload much earlier than WAL and COW. Specifically, under the uniform data distribution, RB 755 finishes inserting all the records in 39 s, achieving significant 756 reductions of 45% and 84% compared with WAL and COW, 757 respectively. Results also show that all the methods have 758 longer execution times under the normal data distribution. This 759 760 is because with the normal distribution, keys in each dimension (humidity or temperature) are more concentrated. Popular tree 761 762 paths grow deeper than others, leading to key-value operations <sup>763</sup> along these paths taking more time to complete.

Among the factors that contribute to the total execution time, the insertion overhead is the primary factor, followed by the overhead of checkpoint commit. This is because insertion and commit are the two most frequent operations during the workload. Now, RB has the lightest overhead to insert a record, because it creates a write mark in the global undo log only on the first post-checkpoint modification to a node, to the node rate, or to the record area. Subsequent post-checkpoint writes proceed without writing extra information to the undo log. To



Fig. 13. Capacitor voltage with respect to workload progress with the uniform dataset. Notice that the *x*-axis denotes the *progress* rather than the wall-clock time.



Fig. 14. Time overhead of insertion operations.

commit a checkpoint, RB simply discards the contents in the 773 global undo log. In contrast, WAL must add a log record to 774 the redo log on every write. In addition, when committing a 775 checkpoint, for every log record, WAL must copy the written 776 data from the log record to the destination flash address. As for 777 COW, when an insertion operation goes to a pre-checkpoint 778 node, COW has to copy the node and all pre-checkpoint 779 nodes along the path toward the root node. In addition, when 780 the current partition is full, COW compacts the partition (in 781 an out-of-place manner) and commits a checkpoint. In other 782 words, the highest runtime overhead of COW results from 783 the wandering tree algorithm and the partition compaction 784 procedure.

Another factor that contributes to the total execution time 786 is checkpoint restoration. Fig. 12(b) shows the power event 787 counts. While RB experiences 6 times of power interruption, 788 WAL and COW encounter 12 and 40 times of power outage, 789 respectively, under the uniform data distribution. Furthermore, 790 Fig. 13 depicts the voltage of the capacitor throughout the 791 progress of the entire workload. RB pushes the progress much 792 charging cycles throughout the workload. In contrast, WAL 794 and COW progress slowly and the capacitor is recharged more 795 often. The frequent checkpoint restoration operations upon 796 device restarts further increase their total execution times. 797

# C. Key-Value Operation Overhead

In this section, we report the average, maximum, and 799 minimal latencies of key–value operations, including insertion 800 and query. Fig. 14 shows the insertion latency. The trend 801 in the insertion latencies appears highly consistent with the 802



Fig. 15. Time overhead of range query operations.

<sup>803</sup> trend in the total execution times, as insertion is the most <sup>804</sup> frequent operations in the workload. Insertion with a normal 805 distribution is slower than with a uniform distribution because 806 popular tree paths are relatively deeper and insertion on <sup>807</sup> these paths are slower. RB is the fastest on insertion thanks 808 to the log-and-mark design: it adds a write mark to the <sup>809</sup> global undo log to distinguish pre-checkpoint data and post-810 checkpoint data. WAL takes about three times as long as 811 RB to insert a record on average (1.1 ms versus 0.3 ms). This is because WAL amplifies the write cost by including 812 target flash address in each log record in addition to the 813 a 814 written data. Furthermore, when searching the tree index 815 during insertion, WAL examines the redo log for updates to a 816 node. Searching for updates takes extra time, even though we <sup>817</sup> have implemented a tiny hash table in the working memory aid the search. COW is the slowest on insertion, about nine 818 to 819 times slower than RB on average (2.9 ms versus 0.3 ms). 820 Notably, the maximum latency of COW is extremely high. 821 This is attributed to the high cost of copying a full path for 822 the wandering tree algorithm.

We evaluate the query performance upon the completion 823 824 of the workload because at this time, the key-value store has 825 been fully stressed and contains sufficiently many records for 826 query. Specifically, RB, WAL, and COW have about 4500 valid records in flash before our query test. We present results 827 828 of range queries (scan) rather than point queries (get) because 829 get is hardly useful to sensing applications. In addition, get 830 is a subset of scan. In this test, we specify a random range on each dimension and retrieve the first 256 records from 831 832 the key-value store. Fig. 15 shows that RB and COW are 833 comparable in terms of the query latency. In contrast, WAL <sup>834</sup> performs noticeably worse than the other two, because every 835 time when reading a tree node, WAL must examine the redo <sup>836</sup> log for any recent updates to the node. The queries are faster <sup>837</sup> under a normal distribution as they cover unpopular (shorter) 838 paths.

## 839 D. Checkpoint Operation Overhead

Fig. 16(a) shows the time overhead of checkpoint commit for RB, WAL, and COW. Notably, RB commits a checkpoint much faster than WAL and COW. The overhead to commit a checkpoint with RB includes 1) making a backup of the rogram context; 2) writing a commit record to the global undo log; and 3) erasing unused segments of the global undo log set (see Fig. 10). For the first item, based on DMA, the process set of backing up the 200 bytes of SRAM program context to



Fig. 16. Time overhead (a) to commit a checkpoint and (b) to restore a checkpoint. Data distribution is uniform.

flash memory accounts for less than 10% of the total commit 848 latency. For the third item, RB merely erases one segment 849 after three times of commit on average. This is because RB 850 slowly consumes the global undo log space by writing tiny 851 write marks to the log *only* on the first post-checkpoint write 852 to an area or node (see Fig. 9). 853

The commit overhead of WAL is much higher than that of RB. On every write, WAL adds a log record consisting of a target flash address and the written data, and therefore it must erase segments from its redo log more often. In addition, WAL must also write back all the pending log records in its redo log to complete checkpoint commit. Here, COW shows an extremely high overhead. COW basically switches the root to commit a checkpoint. However, due to its quick consumption of flash space through path copying, if the current partition is full and its space utilization for valid records is low, COW compacts the partition to free up space. We measured that COW improves the valid record count by about 1.7 times through compaction, making it comparable to RB and WAL.

Fig. 16(b) shows the time overhead to restore a checkpoint. 867 While commit is a periodic event, restore is conducted only 868 upon power recovery. To restore a checkpoint, RB loads the 869 backup of program context and undoes post-checkpoint flash 870 writes from flash. As detailed in Section IV-D, instead of 871 employing the expensive erase-based undo, RB uses discard 872 undo, which efficiently neutralizes all post-checkpoint data 873 through one-way bit flipping. In contrast, for WAL, all post- 874 checkpoint data is contained within its redo log, and therefore 875 on checkpoint restoration, WAL erases the redo log to discard 876 all post-the checkpoint writes. COW suffers from the highest 877 overhead to restore a checkpoint. This is because COW 878 consumes flash space at a high rate and, on checkpoint 879 restoration, a large amount of post-checkpoint data must be 880 erased from flash. 881

Fig. 17 shows the CDF of iFKVS (RB) operation latencies. <sup>882</sup> Insert and commit show a stable latency, with exceptions of <sup>883</sup> slow insertion due to node splitting and slow commit for <sup>884</sup> erasing log segments. The restore latency is proportional to the <sup>885</sup> amount of post-checkpoint data undone, and the scan latency <sup>886</sup> depends on the tree heights of the scanned records. <sup>887</sup>

Based on the technique in [32], we test the checkpoint <sup>888</sup> correctness by replaying the workload on iFKVS with and <sup>889</sup> without power events. Besides random blackouts, we inject <sup>890</sup> asynchronous power fails to checkpoint commit, checkpoint <sup>891</sup> restore, and record insertion. When done replaying the work-<sup>892</sup> load, we retrieve all records through the quadtree and hash <sup>893</sup>



Fig. 17. CDF of iFKVS (RB) op. latencies under uniform. (a) Insert. (b) Commit. (c) Restore. (d) Scan.



Fig. 18. Evaluating iFKVS (RB) with (a) lazy or regular node split and (b) discard undo or erase undo. Data distribution is uniform.

their contents. The quadtree and the undo log are not hashed
as they are affected by power events. The correctness of our
checkpoint operation is verified by the consistent hash values
observed both with and without power events.

## 898 E. Differential Evaluation

We evaluate our iFKVS design by turning on and off 899  $_{900}$  individual features. Fig. 18(a) shows the insertion latency with and without our lazy split method. With lazy split, when a node 901 <sup>902</sup> splits, the node retains all its data (pointers actually) without <sup>903</sup> migrating them to its new child nodes. Lazy split reduces the average insertion latency by about 20% and significantly 904  $_{905}$  lowers the maximum time length. Fig. 18(b) depicts the time 906 overheads to restore a checkpoint with the proposed discard <sup>907</sup> undo method and the erase-based undo method. While discard <sup>908</sup> undo neutralizes post-checkpoint data through bit flipping, <sup>909</sup> erase undo must undergo a copy-erase-copyback procedure to 910 erase post-checkpoint data from flash. Our results indicate that 911 restoring a checkpoint with discard undo requires only 26% 912 of the time compared with using erase undo.

## 913 F. Irregular Power Traces

We also evaluate our iFKVS using irregular power traces. To 914 <sup>915</sup> collect power traces, we place a 5.4-cm<sup>2</sup> solar panel alongside 916 A road, where passing pedestrians can obstruct the sunlight. We record the power output of the panel, and the power traces 917 918 are stored in the power supply for replay. A fragment of the 919 traces can be found in Fig. 2. Due to the ample solar power, we scale down the amplitude of the traces by  $0.1 \times$  to emulate 920 the power from a smaller solar panel. Fig. 19(a) shows the 921 <sup>922</sup> total execution times of RB and WAL, and RB completes the workload much earlier than WAL. COW is not included here 923 <sup>924</sup> because it suffers from stagnation during checkpoint recovery <sup>925</sup> and does not complete the workload. Fig. 19(b) depicts the



Fig. 19. (a) Total execution times and (b) capacitor voltage under irregular power traces. Data distribution is uniform.

TABLE III Energy Analysis Summary

|               | 7mA  | 6mA | 5mA | Flash-op only | iFKVS(FRAM) |
|---------------|------|-----|-----|---------------|-------------|
| Time (s)      | 38.9 | 51  | 97  | 19.9          | 11.7        |
| Charge cycles | 6    | 11  | 23  | 3             | 1           |

capacitor voltage, showing 1) unpredictable fluctuation in the 926 voltage level due to an unstable power input and 2) more 927 charging cycles for WAL throughout the workload due to its 928 high operational overheads. 929

## G. Flash Lifetime Analysis

Two major factors determine the flash lifespan: 1) the data 931 sampling (writing) rate and 2) the flash size. In Section V-B, 932 iFKVS completes 20000 record insertions in about 39 s and 933 performs 914 segment erases. This reflects a sampling rate of 934  $(20\,000/39 \text{ s})\approx 513 \text{ Hz}$ . As ambient conditions like temperature 935 and humidity do not change abruptly, we assume a practical 936 sampling rate of 10 Hz. Therefore, 20000 record insertions 937 require  $(20000/10) \times 1$  s = 2000 s. iFKVS rotates partitions <sub>938</sub> to ensure even erasure (see Section IV-E). Consider a 256-kB 939 flash memory, which has 512 segments. At a 10-Hz sampling 940 rate, evenly erasing all the 512 segments once requires 2000 s  $\times$  941  $(512/914) \approx 1120$  s. According to [33], flash endures  $10^5_{942}$ erases, so it takes 1120 s  $\times 10^5 \approx 3.6$  years to retire the flash 943 memory. This analysis does not involve the undo log; however, 944 its wear can be managed by adjusting the log length. 945

## H. Energy Analysis

We evaluate iFKVS using three power levels, 7 mA (the 947 default), 6 mA, and 5 mA, at 3.3 V. Table III reports the 948 total execution times and charge cycle counts for each power 949 level, and the observed duty cycles (active periods) for these 950 levels are 82%, 66%, and 49%, respectively. The reduction 951 in the power strength significantly amplifies the execution 952 overhead, highlighting the loss of post-checkpoint progress, 953 slow charging, and cost of checkpoint restoration. 954

Both the microprocessor and flash memory contribute to <sup>955</sup> energy consumption. We monitor how many flash reads, <sup>956</sup> writes, and erases are used for indexing and checkpointing, <sup>957</sup> and we replay these amounts of flash operations without the <sup>958</sup> iFKVS stack. The flash-only replay uses 19.9 s in 3 charge <sup>959</sup> cycles, while the full iFKVS stack uses 38.9 s in 6 cycles. <sup>960</sup> Therefore, storage-related flash operations contribute to about <sup>961</sup> half of the total energy usage. <sup>962</sup>

930

970

998

We also modify our flash-based iFKVS to run on an FRAMbased SoC MSP430FR5994. We observe that the FRAM-based iFKVS completes the workload in 11.7 s with 1 charge cycle. Here, FRAM demonstrates its power efficiency for both program execution and storage management. However, as discussed in Section II-B, flash is more cost-efficient, with its applicability depending on sustainable power consumption.

## VI. CONCLUSION

In the foreseeable future, near-data processing will be 971 972 crucial for energy-harvesting sensing applications. This work 973 aims to close the gap between intermittent computation and 974 flash-based key-value storage. We target on two issues. First, 975 checkpointing must involve not only the program context but 976 also the storage state to achieve global consistency. Second, writes after the most recent checkpoint must be undone from 977 978 flash memory for fast power recovery. We propose a global 979 undo log that guarantees the synchrony between the states 980 of the program and the storage. For checkpointing in flash, we propose a novel log-and-mark approach, which treats 981 982 everything as a log space, including the node area, the record 983 area, and each individual node. With this, post-checkpoint 984 data can be easily identified and undone from flash. Based 985 on a unique property of flash memory, one-way bit flipping, we propose replacing the erase-based flash undo procedure 986 with idempotent zero-filling writes. Our experiments show 987 988 that, compared to a WAL-based approach and a COW-based <sup>989</sup> method, our design achieves a significant reduction in the total <sup>990</sup> execution time by 45% and 84%, respectively.

As part of a smart building project, our system prototype is deployed by meeting room windows to optimize room reservation and air-conditioning plans using event-based queries on per illumination and temperature readings. Furthermore, because page-based NAND flash offers a higher storage density, we are thrusting toward an FRAM-NAND hybrid approach for a per cheaper, larger key-value store.

#### References

- 999 [1] D. Brunelli, C. Moser, L. Thiele, and L. Benini, "Design of a solar-harvesting circuit for batteryless embedded systems," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 56, no. 11, pp. 2519–2528, Nov. 2009.
- M. Magno and D. Boyle, "Wearable energy harvesting: From body to battery," in *Proc. 12th Int. Conf. Design Technol. Integr. Syst. Nanoscale Era (DTIS)*, 2017, pp. 1–6.
- [3] J. Choi, H. Joe, Y. Kim, and C. Jung, "Achieving stagnation-free intermittent computation with boundary-free adaptive execution," in *Proc. IEEE Real-Time Embedd. Technol. Appl. Symp.*, 2019, pp. 331–344.
- K. Maeng and B. Lucia, "Supporting peripherals in intermittent systems with just-in-time checkpoints," in *Proc. 40th ACM SIGPLAN Conf. Program. Lang. Design Implement.*, 2019, pp. 1101–1116. [Online].
  Available: https://doi.org/10.1145/3314221.3314613
- K. Maeng, A. Colin, and B. Lucia, "Alpaca: Intermittent execution without checkpoints," *Proc. ACM Program. Lang.*, vol. 1, pp. 1–30, Oct. 2017.
- [6] S. Lin, D. Zeinalipour-Yazti, V. Kalogeraki, D. Gunopulos, and
  W. A. Najjar, "Efficient indexing data structures for flash-based sensor devices," *ACM Trans. Storage*, vol. 2, no. 4, pp. 468–503, 2006.
  [Online]. Available: https://doi.org/10.1145/1210596.1210601
- [7] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, "Edge computing: Vision and challenges," *IEEE Internet Things J.*, vol. 3, no. 5, pp. 637–646, Oct. 2016.
- 1022 [8] M. A. Sahi et al., "Privacy preservation in e-Healthcare environments:
  1023 State of the art and future directions," *IEEE Access*, vol. 6, pp. 464–478, 2018.

- [9] Y. Diao, D. Ganesan, G. Mathur, and P. J. Shenoy, "Rethinking data 1025 management for storage-centric sensor networks," in *Proc. CIDR*, 2007, 1026 pp. 22–31. 1027
- [10] R. A. Finkel and J. L. Bentley, "Quad trees a data structure for retrieval 1028 on composite keys," *Acta Informatica*, vol. 4, pp. 1–9, Mar. 1974. 1029
- [11] J. Choi, H. Joe, and C. Jung, "CapOS: Capacitor error resilience for 1030 energy harvesting systems," *IEEE Trans. Comput.-Aided Design Integr.* 1031 *Circuits Syst.*, vol. 41, no. 11, pp. 4539–4550, Nov. 2022. 1032
- B. Ransford, J. Sorber, and K. Fu, "Mementos: System support for long- 1033 running computation on RFID-scale devices," in *Proc. 16th Int. Conf.* 1034 *Archit. Support Program. Lang. Oper. Syst.*, 2011, pp. 159–170.
- [13] K. Akhunov, E. Yildiz, and K. S. Yildirim, "Enabling efficient 1036 intermittent computing on brand new microcontrollers via tracking 1037 programmable voltage thresholds," in *Proc. 11th Int. Workshop Energy* 1038 *Harvest. Energy-Neutral Sens. Syst.*, 2023, pp. 16–22. 1039
- [14] M. Nardello, L. Caronti, and D. Brunelli, "Intermittent intelligent camera 1040 with LEO sensor-to-satellite connectivity," in *Proc. 11th Int. Workshop* 1041 *Energy Harvest. Energy-Neutral Sens. Syst.*, 2023, pp. 79–85. 1042
- [15] B. Mazumder and J. O. Hallstrom, "A fast, lightweight, and reli- 1043 able file system for wireless sensor networks," in *Proc. 13th* 1044 *Int. Conf. Embedd. Softw.*, 2016, pp. 1–10. [Online]. Available: 1045 https://doi.org/10.1145/2968478.2968486
- [16] Y.-J. Wu, C.-Y. Kuo, and L.-P. Chang, "iNVMFS: An efficient file 1047 system for NVRAM-based intermittent computing devices," *IEEE* 1048 *Trans. Comput.-Aided Design Integr. Circuits Syst.*, vol. 41, no. 11, 1049 pp. 3638–3649, Nov. 2022.
- [17] K. Maeng and B. Lucia, "Adaptive dynamic checkpointing for safe 1051 efficient intermittent computing," in *Proc. 13th USENIX Symp. Oper.* 1052 *Syst. Design Implement. (OSDI)*, 2018, pp. 129–144. 1053
- [18] S. Liu, W. Zhang, M. Lv, Q. Chen, and N. Guan, "LATICS: A low- 1054 overhead adaptive task-based intermittent computing system," *IEEE* 1055 *Trans. Comput.-Aided Design Integr. Circuits Syst.*, vol. 39, no. 11, 1056 pp. 3711–3723, Nov. 2020. 1057
- [19] A. Colin, E. Ruppel, and B. Lucia, "A reconfigurable energy stor- 1058 age architecture for energy-harvesting devices," in *Proc. 23rd Int.* 1059 *Conf. Archit. Support Program. Lang. Oper. Syst.*, 2018, pp. 767–781. 1060 [Online]. Available: https://doi.org/10.1145/3173162.3173210 1061
- [20] F. Fraternali, B. Balaji, Y. Agarwal, L. Benini, and R. Gupta, "Pible: 1062 Battery-free mote for perpetual indoor BLE applications," in *Proc. 5th* 1063 *Conf. Syst. Built Environ.*, 2018, pp. 168–171. 1064
- [21] G. Gobieski, B. Lucia, and N. Beckmann, "Intelligence beyond the 1065 edge: Inference on intermittent embedded systems," in *Proc. 24th Int.* 1066 *Conf. Archit. Support Program. Lang. Oper. Syst.*, 2019, pp. 199–213. 1067 [Online]. Available: https://doi.org/10.1145/3297858.3304011 1068
- [22] A. Montanari, M. Sharma, D. Jenkus, M. Alloulah, L. Qendro, 1069 and F. Kawsar, "ePerceptive: Energy reactive embedded intelligence 1070 for batteryless sensors," in *Proc. 18th Conf. Embedd. Netw. Sens.* 1071 *Syst.*, 2020, pp. 382–394. [Online]. Available: https://doi.org/10.1145/ 1072 3384419.3430782 1073
- [23] H. R. Mendis, C.-K. Kang, and P.-C. Hsiu, "Intermittent-aware neural 1074 architecture search," ACM Trans. Embedd. Comput. Syst., vol. 20, no. 5s, 1075 pp. 1–27, 2021. 1076
- [24] H. Dai, M. Neufeld, and R. Han, "ELF: An efficient log-structured flash 1077 file system for micro sensor nodes," in *Proc. 2nd Int. Conf. Embedd.* 1078 *Netw. Sens. Syst.*, 2004, pp. 176–187.
- [25] N. Tsiftes, A. Dunkels, Z. He, and T. Voigt, "Enabling large-scale storage 1080 in sensor networks with the coffee file system," in *Proc. Int. Conf. Inf.* 1081 *Process. Sens. Netw.*, 2009, pp. 349–360.
- [26] A. Fevgas and P. Bozanis, "Grid-file: Towards to a flash efficient multi- 1083 dimensional index," in *Proc. 26th Int. Conf. Data Manage. Cloud, Grid* 1084 *P2P Syst.*, 2015, pp. 285–294.
- [27] L.-P. Chang and C.-H. Hsu, "Soft lists: A native index structure for nor- 1086 flash-based embedded devices," in *Proc. Asia South Pac. Design Autom.* 1087 *Conf.*, 2009, pp. 799–804.
- [28] A. Guttman, "R-trees: A dynamic index structure for spatial searching," in Proc. ACM SIGMOD Int. Conf. Manage. Data, 1984, pp. 47–57. 1090
- [29] "Colorado agricultural meteorological network (CoAgMET) 1091 Jul. data set." Accessed: 1, 2023. [Online]. Available: 1092 https://coagmet.colostate.edu/ 1093
- [30] J. Condit et al., "Better I/O through byte-addressable, persistent 1094 memory," in Proc. ACM SIGOPS 22nd Symp. Oper. Syst. Princ., 2009, 1095 pp. 133–146. 1096
- [31] S. R. Dulloor et al., "System software for persistent memory," in Proc. 1097 9th Eur. Conf. Comput. Syst., 2014, pp. 1–15. 1098
- [32] J. Van Der Woude and M. Hicks, "Intermittent computation without 1099 hardware support or programmer intervention," in *Proc. 12th USENIX* 1100 *Symp. Oper. Syst. Design Implement. (OSDI)*, 2016, pp. 17–32. 1101
- [33] "MSP430 flash memory characteristics," Application note, Texas 1102 Instrum., Dallas, TX, USA, 2018. [Online]. Available: https://www.ti. 1103 com/lit/an/slaa334b/slaa334b.pdf