NEHASH: high-concurrency extendible hashing for non-volatile memory*

2023-06-02 12:30:56TaoCAIPengfeiGAODejiaoNIUYuemingMATianleLEIJianfeiDAI

Tao CAI,Pengfei GAO,Dejiao NIU,Yueming MA,Tianle LEI,Jianfei DAI

School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China

Abstract: Extendible hashing is an effective way to manage increasingly large file system metadata,but it suffers from low concurrency and lack of optimization for non-volatile memory (NVM).In this paper,a multilevel hash directory based on lazy expansion is designed to improve the concurrency and efficiency of extendible hashing,and a hash bucket management algorithm based on groups is presented to improve the efficiency of hash key management by reducing the size of the hash bucket,thereby improving the performance of extendible hashing.Meanwhile,a hierarchical storage strategy of extendible hashing for NVM is given to take advantage of dynamic random access memory (DRAM) and NVM.Furthermore,on the basis of the device driver for Intel Optane DC Persistent Memory,the prototype of high-concurrency extendible hashing named NEHASH is implemented.Yahoo cloud serving benchmark (YCSB) is used to test and compare with CCEH,level hashing,and cuckoo hashing.The results show that NEHASH can improve read throughput by up to 16.5% and write throughput by 19.3%.

Key words: Extendible hashing;Non-volatile memory (NVM);High concurrency

1 Introduction

While the file system stores and manages mas‐sive data,the amount of its own metadata is increas‐ing,and the number of applications that need to be served at the same time is also increasing.Efficient management and indexing of file metadata has be‐come an important factor affecting file system perfor‐mance.Extendible hashing is a very effective data in‐dexing method.Compared with B-tree and other treebased indexing algorithms,it has the advantages of less time overhead and relative stability.It has been applied to many file systems,such as Zettabyte file system (ZFS),general parallel file system (GPFS),Google file system (GFS),and GFS2.It has become a hot spot of current research.

Current extendible hashing still has many limita‐tions that affect performance.First,when the extend‐ible hashing is running,after the data in a hash bucket are full,it needs to be dynamically expanded.At this time,the entire hash directory will be locked,and then expand the size of the hash directory to twice the original size,and the hash bucket also needs to be locked and split.This makes the entire scalable hash inaccessible to other processes,causing all read and write access to be suspended.Second,a single hash bucket is used to store hash keys under each hash directory entry in the extendible hashing.When writing a hash key,the corresponding hash bucket needs to be locked so that the hash bucket cannot be accessed by other processes.Third,the hash key is managed in an unordered way in the hash bucket.Although sorting and other overhead can be reduced,the search time overhead is large,making the size of the hash bucket an important factor affecting extend‐ible hashing performance.Finally,with the increase in the number of hash keys in extendible hashing,it is difficult to store all hash keys in dynamic random ac‐cess memory (DRAM),which makes the input/output(I/O) performance of external memory devices an im‐portant factor affecting extendible hashing perfor‐mance.Therefore,improving the concurrency and ac‐cess performance of extendible hashing is an impor‐tant issue in current extendible hashing research.

Non-volatile memory (NVM),such as phase change memory (PCM) (Li and Lam,2011),shared transis‐tor technology random access memory (STT-RAM)(Kuan and Adegbija,2019),the latest technology Intel 3D X point (Dadmal et al.,2017),and commercial Intel Optane DC Persistent Memory (Intel,2019),has the advantages of read and write speeds close to DRAM and NVM,which provides good support for improving the access speed of data stored in external memory devices.However,compared with DRAM,NVM still has limitations in terms of read and write speed and concurrency;at the same time,NVM has the advan‐tage of read and write speed compared to solid state drive (SSD) and hard disk drive (HDD),making access concurrency an important factor in data access perfor‐mance.Therefore,how to use the characteristics of NVM,improve the concurrency of hash key manage‐ment and access,and study new extendible hashing have become an urgent problem to be solved.

Therefore,in the hybrid storage composed of DRAM and NVM,a high-concurrency extendible hashing for NVM (NEHASH) is given.By improving the management strategy of hash directories and hash buckets,the locking granularity in extendible hashing management is reduced to support the highconcurrency access requirements for extendible hashing.

The main contributions of our work can be sum‐marized as follows:

1.The hash directory based on lazy expansion is designed for all access pause problems caused by the expansion of the extendible hashing directory when it is expanded.A three-level hash directory is used instead of a single-level one.The concurrency of extendible hashing is improved by reducing the granularity of locking to a single hash directory entry that causes the expansion.The expansion rate calculation algo‐rithm is designed to dynamically determine the size of the two-time expansion hash directory and can dis‐tinguish the difference in the number of hash keys and growth rates of each part of the extendible hash‐ing.In addition,while delaying the expansion of the main hash directory,the space waste of the extendible hashing directory can be reduced,and the access and management efficiency of the extendible hashing di‐rectory can be improved.

2.To improve the concurrency of accessing hash keys,first,the single hash bucket mounted under the existing extendible hashing directory is decomposed into multiple buckets,then the bucket directory is added,and the management methods of hash keys in the hash bucket are changed.A hash bucket manage‐ment algorithm based on groups is designed to reduce the locking granularity when inserting hash keys,the split probability of hash buckets is reduced,and the concurrency of extendible hashing is improved.

3.In view of the frequent access to the hash di‐rectory and the large number of layers,a hierarchical storage strategy is given,which stores the hash direc‐tory in DRAM and the bucket group in NVM.It can take advantage of the respective advantages of DRAM and NVM to meet the management requirements of massive metadata in the file system.In addition,a hash directory recovery algorithm is designed.After the system restarts,it relies on the hash bucket and other information in the NVM to reconstruct the hash directory to ensure the reliability of the extendible hashing.

4.On the basis of using Intel Optane DC Persis‐tent Memory,its device driver is modified to imple‐ment the high-concurrency extendible hashing proto‐type NEHASH embedded in the device driver.Using the Yahoo cloud serving benchmark (YCSB) test tool for testing and analysis,the test results show that NEHASH has higher concurrency than existing hash‐ing methods,such as CCEH,level hashing,and cuckoo hashing.In a multithreaded environment,the read throughput can be increased by 16.5%,and the write throughput can be increased by 19.3%.

2 Related works

At present,much research has been carried out on extendible hashing,NVM storage devices,and hashing for NVM at home and abroad.The main research contents are as follows.

2.1 Extendible hashing

Hash algorithm is a strategy that trades storage space for time,and it is a program algorithm that can achieve fast storage and search.Static hashing is based on an array,and its bucket capacity,which is the size of the array,is determined at the beginning of the definition.However,for most databases,it is difficult to predict the number of data buckets required at the beginning,and the database generally grows dynami‐cally.In this case,the static hashing index structure will occur frequently due to the increase in data.The expansion and rehashing of the hash table seriously affect the performance of the database.Dynamic hash‐ing is an upgrade to normal static hashing.It can grow automatically,allowing the number of buckets to be dynamically changed to accommodate the dynamic growth and shrinkage of the database.Extendible hash‐ing (Fagin et al.,1979) is a kind of dynamic hashing,which is classified by the high (low) bits of the key processed by the hash function,similar to the bucket sort in the data structure.Extendible hashing consists of multiple buckets,each storing a fixed amount of data and a resizable array (called a directory),where each hash directory entry stores a pointer to the bucket.Wang and Wang (2010) proposed a new extendible hashing index for flash-based database management systems (DBMSs) and added a split or merge (SM)factor to make it adaptive.Analytical and experimental results show that our design minimizes the cost of index maintenance and that the SM factor makes it work efficiently at different write/update ratios.Hu et al.(2021) presented a lock-free parallel multiparti‐tion extendible hashing (PMEH) scheme,which elim‐inates lock contention overhead,reduces data move‐ment during rehashing,and guarantees data consis‐tency.RACE (Zuo et al.,2022) is a one-sided remote direct memory access (RDMA) conscious extendible hashing index with lock-free remote concurrency con‐trol and efficient remote resizing.RACE hashing en‐ables all indexing operations to be performed effi‐ciently by using only one-sided RDMA verbs without involving any computational resource in the memory pool.Dash (Lu et al.,2020) is a scalable extendible hashing for persistent memory (PM),while its bucket size is 256 bytes (the size of the data center persistent memory module (DCPMM) cache line).It maintains a 1-byte fingerprint for each slot to reduce unneces‐sary bucket scans and accelerate the search.For con‐currency control,Dash employs optimistic bucketlevel locking,which is implemented by compare-andswap instructions and a version number.

The above research on extendible hashing can‐not be well adapted to NVM storage devices and cannot support the concurrency of the system under multithreading.

2.2 NVM storage devices

NVM storage devices can form mixed memory with DRAM or mixed external memory with HDD and SSD,thereby effectively improving the perfor‐mance of data storage and access.Hibachi is a hybrid NVM and DRAM buffering strategy for storage arrays(Fan et al.,2017).Hibachi handles read cache and write cache hits separately,improves the cache hit rate,and dynamically adjusts the size of clean and dirty caches to adapt to changes in workload.Random writes are converted to sequential writes to improve write throughput and optimize read performance based on workload characteristics.FlatStore is a new key-value(KV) storage engine designed for NVM and DRAM hybrid memory (Chen YM et al.,2020).It uses NVM to achieve reliable storage of logs and uses OpLog per core to improve throughput and reduce access latency using a pipeline-level batching mechanism.HasFS is a file system for DRAM,NVM,and disk hybrid storage systems (Liu et al.,2020).It uses NVM storage devices to expand the main memory capacity,builds a persistent page cache to eliminate the over‐head of regularly writing dirty data back to disk,and designs a hybrid memory consistency mechanism that reduces the time and space overhead of protecting metadata and data.Ziggurat is a layered file system for NVM and disk (Zheng,2019) that stores write data into NVM,DRAM,or disk,depending on applica‐tion access data patterns,write size,and possibility to pause until the write operation completes to com‐bine the performance of NVM and the capacity advan‐tages of disk;at the same time,by calculating the heat of file data,cold data are migrated from NVM to disk,and the write bandwidth of disk is effectively used by merging data blocks.Strata is an NVM hybrid storage system (Kwon et al.,2017) that uses the byte address‐ing feature of NVM to merge logs and reduce write amplification when writing to SSD and disk,and uses NVM for file allocation and unidirectional migration of data to SSD and disk.vNVML is a user-mode func‐tion library (Chou et al.,2020) that can mix NVM with SSD and disk to build a large-capacity and highspeed storage system,effectively ensure the write order and reliability of data,and use DRAM to build a read cache to reduce the number of writes to NVM.TridentFS (Huang and Chang,2016) stores hot data in PM to reduce flash memory with slow I/O access,while NVMRA (Chen RH et al.,2016) uses PM to im‐prove the random write performance of flash memory.

None of the above studies on NVM storage de‐vices are optimized in combination with extendible hashing.

2.3 Hashing for NVM

In recent years,a variety of NVM-based hash index structures have been proposed.PFHT (Debnath et al.,2015) is a variant of PCM-friendly cuckoo hash‐ing that allows only one cuckoo shift to avoid cascad‐ing writes,thus reducing write accesses to PCM.To improve the insert performance of cuckoo hashing,PFHT uses a hidden block to defer full table rehash‐ing and increase the load factor.Path hashing (Zuo and Hua,2017) is a write-friendly hashing that logi‐cally organizes storage units in the form of an inverted binary tree and uses location sharing to resolve hash col‐lisions.Level hashing (Zuo et al.,2018) is a writeoptimized,high-performance hash index scheme that proposes a shared two-level hash table,where the top layer can address entries and the bottom layer is used to handle hash collisions.It achieves constantscale search,insert,delete,and update time complex‐ity in the worst case and generates few extra NVM writes.To ensure low-overhead consistency,level hash‐ing uses a no-log consistency scheme for insert,de‐lete,and adjust operations and an opportunistic nolog scheme for update operations.To cost-effectively re‐size this hash table,level hashing uses an in-place re‐sizing scheme that needs only to rehash 1/3 of the buckets instead of the entire table,thus greatly reduc‐ing the number of rehashed buckets and improving resize performance.The evaluation results show that level hashing is significantly better than PFHT and path hashing.

The above three hash index structures are all implemented based on static hashing.Their common problem is that the rehashing operation of the entire hash table requires considerable computing resources and storage overhead,which will cause a huge delay and is relatively expensive.

CCEH (Nam et al.,2019) is a variant of extendible hashing that dynamically splits and merges hash buckets as needed to overcome the disadvantage of full table rehashing.By introducing the three-layer structure of the middle segment,the size of the direc‐tory and the size of the bucket can be reduced to achieve constant-level search,that is,only two cache line accesses.Atomic write optimization delays delete operations to reduce I/O overhead.HMEH (Zou et al.,2020) is an optimization scheme based on CCEH,which proposes a hybrid RAM-NVM write-optimal and high-performance extendible hashing scheme,where KV items persist in NVM,while directories are placed in DRAM for faster access.To rebuild the directory,HMEH also maintains a radixtree-structured catalog in NVM with negligible overhead.Further‐more,HMEH proposes a cross-KV strategy by natu‐rally eviction write-back items,which ensures data con‐sistency without performance degradation due to per‐sistent barriers.

Both CCEH and HMEH are implemented based on extendible hashing,which can well solve the prob‐lem of full table rehashing.However,in the case of multithreading,the frequency of hash directory expan‐sion will increase,and when the hash directory is ex‐panded,all other threads cannot access,resulting in reduced concurrency.

3 Lazy expansion strategy for hash directory

In the existing extendible hashing,when expand‐ing the hash directory,the entire hash table needs to be locked,which seriously affects the concurrency of the extendible hashing.We modify the structure of the hash directory in extendible hashing,design a lazy expansion algorithm for the hash directory,and im‐prove the concurrency of extendible hashing.

The original single-layer hash directory in extend‐ible hashing is extended to build a three-layer extend‐ible hashing directory,i.e.,the main hash directory,the one-time expansion hash directory,and the two-time expansion hash directory.As shown in Fig.1,the 0th,2nd,and 6thentries in the main hash directory have the one-time expansion hash directory,and the 0thand 12thentries in the one-time expansion hash directory have the two-time expansion hash directory.The struc‐ture of the two-time expansion hash directory is the same as the existing extendible hashing directory struc‐ture,and each hash directory entry is represented by HL(HL_value,point),where HL_value is the hash di‐rectory value and point is a pointer to a hash bucket.The structures of the main hash directory and one-time expansion hash directory are modified.Each hash directory entry is represented by HL_E(HL_value,time_value,point),where time_value is a time-related value and point is a pointer to the next level hash ex‐tension directory or hash bucket.On extendible hash‐ing initialization,only the main hash directory is creat‐ed and initialized.

Fig.1 Structure of the extendible hashing directory for high-concurrency extendible hashing

On this basis,a lazy expansion algorithm for hash directories is given.First,when the main hash direc‐tory needs to be expanded,it is not expanded first,but only a one-time expansion hash directory is added under the hash directory entry that needs to be extended,and each one-time expansion hash directory contains two hash directory entries in the form of HL_E.The other parts of the main hash directory are unchanged.Second,when the one-time expansion hash directory is full and needs to be expanded,the main hash directory is de‐layed again through the two-time expansion hash di‐rectory.Spread_rate is defined to represent the expan‐sion rate of the two-time expansion hash directory,which corresponds to the one-time expansion hash di‐rectory entry that needs to be extended.On this basis,the rules for constructing the two-time expansion hash directory are as follows:

Rule 1: If the value of Spread_rate is 1,the ex‐tended two-time expansion hash directory contains four sets of hash directory entries in the form of HL.

Rule 2: If the value of Spread_rate is 0,the ex‐tended two-time expansion hash directory contains two sets of hash directory entries in the form of HL.

When building a two-time expansion hash direc‐tory,use Spread_rate to dynamically adjust the size of the generated two-time expansion hash directory to adapt to the uneven distribution and growth rate of hash values under each hash directory entry in the ex‐tendible hashing.The expansion rate calculation algo‐rithm is used to obtain the corresponding Spread_rate value when constructing each two-time expansion hash directory.The main process of the expansion rate calculation algorithm is as follows:

Step 1: When the actual hash value is stored under theithhash directory entry in the main hash directory,save system time to the time_value of the main hash directory entry HL_E,which is recorded ast0-i;

Step 2: When theithhash directory entry in the main hash directory needs to be extended to build a one-time expansion hash directory,compare system time witht0-i,save the two time differences into the time_value of the main hash directory entry HL_E,and overwrite the previoust0-i,recorded as Δti;

Step 3: In the one-time expansion hash directory corresponding to theithhash directory entry in the main hash directory,when the actual hash value starts to be stored under thejthhash directory entry,save its system time into time_value of the one-time expan‐sion hash directory entry HL_E,recorded ast1-i-j;

Step 4: When thejthhash directory entry in the one-time expansion hash directory needs to be ex‐panded to build the two-time expansion hash directory,compare the system time witht1-i-jand record the two time differences as Δti-j;

Step 5: Calculate the value of Spread_rate as follows:

When Δti≥Δti-j,that is,when Spread_rate=0,the time it takes for the main hash directory to start stor‐ing hash values until it is full and to create a one-time expansion hash directory is greater than or equal to the time it takes for the one-time expansion hash di‐rectory to start storing hash values until it is full and to create a two-time expansion hash directory.At this point,it is considered that the number of hash keys belonging to this hash directory entry begins to de‐crease.To reduce the waste of space for the hash di‐rectory,the size of the two-time expansion hash di‐rectory is set to 2.Similarly,when Δti<Δti-j,that is,when Spread_rate=1,the time it takes for the main hash directory to start storing hash values until it is full and to create a one-time expansion hash directory is less than the time it takes for the one-time expan‐sion hash directory to start storing hash values until it is full and to create a two-time expansion hash di‐rectory.At this point,it is considered that the number of hash keys belonging to this hash directory entry begins to increase.To reduce the waste of space for the hash directory,the size of the two-time expansion hash directory is set to 4.

When the two-time expansion hash directory is full,the entire hash directory needs to be expanded.At this point,only the main hash directory needs to be expanded,and the number of bits of HL_value in the main hash directory entry is the maximum number of bits of all HL_values in the original two-time ex‐pansion hash directory.In addition,the bucket groups are pointed to by the original one-time expansion hash directory and the original two-time expansion hash directory are pointed to by the corresponding hash directory entries in the new main hash directory.

By constructing a three-layer hash directory,the global expansion of the hash directory can be delayed.When building the one-time and two-time expansion hash directories,only the entries that cause expansion in the main hash directory or the one-time expansion hash directory need to be added.The lock avoids lock‐ing the entire hash directory so that access to other parts of the extendible hashing is not affected,which can improve the concurrency of the extendible hashing;at the same time,the expansion rate calculation algo‐rithm can be used to perceive the extendible hashing directory.The difference between the number of hash values and the growth rate under different directory entries adaptively adjusts the size of the two-time ex‐tendible hashing directory,reduces the space waste of the extendible hashing directory,and improves the access and management efficiency of the extendible hashing directory.

4 Hash bucket management based on group

First,change the way that only a single hash bucket is used to store hash keys under each hash di‐rectory entry,and build a bucket group for storing hash keys.As shown in Fig.2,each bucket group contains multiple hash buckets,a conflicting hash bucket,and a bucket directory.The hash bucket is used to store hash values,with a size of 256 B,which can adapt to the read and write characteristics of NVM.

Fig.2 Hash value organization based on group

At the same time,the way that hash keys are stored out of order in the hash bucket in the existing extendible hashing is changed.In the hash bucket,the storage location is determined according to the last bit of the hash key.To improve the concurrency of access in the bucket,we use slot-level locks to lock only the slots in the bucket.

When a hash conflict occurs,the conflicting record is directly inserted into the conflicting hash bucket in the bucket group.This is because we have numbered the slots in each bucket,so there is no need for linear probing,avoiding the overhead of additional access to NVM.The size of the conflicting hash bucket is 256 B,which is the size of the NVM access granu‐larity,and the conflicting hash bucket is used to store the conflicting hash key and to reduce the read and write amplification problem.Therefore,the records in the conflicting hash bucket are not sorted,and the records are queried by traversal.

Therefore,the role of each bit in the hash key in extendible hashing management is shown in Fig.3.The highest red bits are the bits corresponding to the extendible hashing directory,the lowest purple bits are the index bits in the hash bucket,and the green bits are the corresponding bits of the bucket directory.Extendible hashing directory bits and bucket directory bits can be dynamically expanded inwards.

Fig.3 The role of bits in hash key

The pseudocode for inserting records in NEHASH is given in Algorithm 1.

The pseudocode of search records in NEHASH is given in Algorithm 2.

In the group-based hash bucket management al‐gorithm,multiple 256 B hash buckets that conform to NVM read–write characteristics are used instead of a single hash bucket,and a bucket directory is added to build a bucket group based on the hash bucket.The concurrency of extendible hashing is improved by re‐ducing the scope of locking when writing hash keys.Using the last bit of the hash key to determine the storage location in the hash bucket can effectively im‐prove the efficiency of search and management and can also use the lock on a single hash record to re‐place the lock on the hash bucket.This improves the concurrency of extendible hashing.

5 Hierarchical storage strategy

In Section 3,a hash directory based on lazy ex‐pansion and a three-layer hash directory are used to replace the original single-layer hash directory,which can not only improve the concurrency of the hash directory but also increase the time overhead of access‐ing the hash directory.Although NVM storage devices have higher read and write speeds,there is still some gap compared to DRAM.Simply using NVM storage devices to store all data in NEHASH cannot guarantee high access performance.

A hierarchical storage strategy is given,and NEHASH’s hash directory and bucket groups are dis‐tributed to DRAM and NVM storage devices.Fig.4 shows NEHASH’s main hash directory,one-time ex‐pansion hash directory,and two-time expansion hash directory storage in DRAM,while all bucket groups are stored in NVM storage devices.

Fig.4 Hierarchical storage strategy of NEHASH

The hash directory stored in the DRAM will be lost after the system restarts,so in the NVM storage device,it is necessary to save some information of the hash directory so that the hash directory can be rebuilt after the system restarts.Recovery_BG(BG_Id,D_Length,BG_Address) is used to represent the in‐formation required for each bucket group during re‐covery.BG_Id is the bucket group identifier,D_Length is the length of the hash directory corresponding to the bucket group,and BG_Address is the address of the bucket group in NVM.Recovery_BG correspond‐ing to all bucket groups is organized into Rec_BG_TA‐BLE in the form of a linear table and stored in the NVM storage device.

When the system restarts,the rebuilding process of the NEHASH hash directory is as follows:

Step 1: Access the Rec_BG_TABLE stored in the NVM storage device and read all Recovery_BG.

Step 2: Compare the size of D_Length in all Recovery_BG,find the maximum value,and record it as D_L_Max.

Step 3: Build the main hash directory with 2D_L_Maxas the length.

Step 4: Compare the D_Length value of each Recovery_BG in Rec_BG_TABLE in turn;if it is equal to D_L_Max,go to step 5;otherwise,go to step 6.

Step 5: Access the corresponding bucket group according to the BG_Address of Recovery_BG,obtain the value of the header D_Length bit of the hash value in the bucket group,and connect it with the corre‐sponding directory entry in the main hash directory.

Step 6: Split the bucket group pointed to by BG_Address,construct the corresponding bucket group according to the standard that the length of the corre‐sponding directory entry is D_L_Max bits,and con‐nect the corresponding bucket group with the corre‐sponding directory entry in the main hash directory.

For the situation in Fig.4,the hash directory recovered after the system restarts is shown in Fig.5.

Fig.5 Recovered hash directory

Using the hierarchical storage strategy,the re‐spective advantages of DRAM and NVM storage de‐vices can be used,and the access and management efficiency of extendible hashing can be improved by distributing hash directories and bucket groups.At the same time,after the hash directory is lost,the hash directory can be restored by relying on the relevant information stored in the NVM storage device,thereby effectively ensuring the reliability of NEHASH.

6 Prototype and analysis

We modify the hash directory and hash bucket on the base extendible hashing,and to reduce the extra time overhead required to convert between kernel mode and user mode when accessing data,the proto‐type system of NEHASH is implemented in the Intel Optane DC Persistent Memory driver PMEM,and new system calls are added to enable user mode programs to perform access.

We use YCSB,which is commonly used in the database field,as a test tool for testing,using 10 mil‐lion random integers as a workload.Since the failure atomic unit of NVM is 8 bytes (Oukid et al.,2016),both the size of the key and the value are set to 8 bytes to ensure the atomicity of the data.Among them,work‐loadais a read‒write balanced type,with 50% read and 50% write;workloadbis read more and write less,with 95% read and 5% write;workloadgis read less and write more,with 5% read and 95% write.Mean‐while,we compare NEHASH with existing hashing schemes such as CCEH,level hashing,and cuckoo hashing (Pagh and Rodler,2004) (called CCEH,LEVL,and Cuckoo).The test environment is built with one server,and the configuration of the server is shown in Table 1.

Table 1 Configuration of the test environment

6.1 Concurrent I/O performance

To test the concurrent access performance of NEHASH,we use 10 million random integers as the workload and use the YCSB tool to test the average throughput of NEHASH at 1,2,4,8,and 16 threads.

Fig.6 presents the average throughput under workloadawith a mix of read and write.When the number of threads is relatively low,CCEH shows a relatively good average throughput.This is because CCEH’s three-layer structure,through cacheline-sized buckets,reduces access to cachelines.However,when the number of threads is relatively high,especially at 16 threads,the average throughput of NEHASH is 13.8%higher than that of CCEH.This is because NEHASH uses a hash directory based on lazy expansion,which can delay the global expansion of the hash directory and avoid frequently locking the entire hash directory,so it can effectively improve concurrency in a multi‐threaded environment.LEVL and Cuckoo show rela‐tively poor average throughput,which does not in‐crease with the increase in the number of threads.This is because LEVL and Cuckoo are based on static hash‐ing,and the rehashing of the static hash table will lead to high latency,blocking all insert and search operations in other threads.

Fig.6 Average throughput under workload a

Fig.7 is the result of the average throughput under the read more write less workloadb.When the number of threads is less than 2,the average throughput of NEHASH is slightly lower than that of CCEH,because CCEH has a cacheline-sized bucket that needs only two cacheline accesses to read the data inside the bucket.However,NEHASH needs to access the multi‐layer hash directory and bucket directory when access‐ing the data in the bucket group,which will lead to relatively high access latency.When the number of threads is greater than 4,the average throughput of NEHASH begins to increase significantly,especially in the stage from 8 threads to 16 threads.The average throughput of NEHASH shows a clear trend of improve‐ment,which is 2.71 times that of LEVL and 3.27 times that of Cuckoo.When the number of threads is 16,the average throughput of NEHASH is 16.5%higher than that of CCEH.This is due to the hash bucket management algorithm based on groups in NEHASH.Through the bucket directory index and the bucket index with the last 4 bits of the hash key,the data in the bucket can be quickly located,improv‐ing the read average throughput in a high-concurrency environment.

Fig.7 Average throughput under workload b

Fig.8 shows the average throughput under work‐loadgwith read less and write more.As seen from Fig.8,the average throughputs of NEHASH and CCEH are significantly higher than those of LEVL and Cuckoo.This is because NEHASH and CCEH are implemented based on extendible hashing,and the hash table can be dynamically expanded with increasing data.When the number of threads is less than 2,the average throughput of NEHASH is lower than that of CCEH.This is because when the number of threads is rela‐tively small,the concurrency is not high at this time,and when NEHASH inserts data,it is necessary to build structures such as bucket groups and bucket directo‐ries first,which results in lower average throughput;however,CCEH avoids the high copy-on-write oper‐ation and reduces the overhead of split through dy‐namic segment split and lazy deletion.When the number of threads reaches 4,the average throughput of NEHASH begins to be higher than that of CCEH.When the number of threads reaches 16,the average throughput of NEHASH still maintains an upwards trend,while CCEH has no obvious upwards trend and begins to level off.The average throughput of NE‐HASH is 19.3% higher than that of CCEH.This is because NEHASH reduces the range of locks when writing hash values when a large amount of data is written concurrently by 16 threads,which improves the performance of extendible hashing.Concurrency can show relatively high throughput in a 16-thread environment.

Fig.8 Average throughput under workload g

6.2 Changing the size of the bucket directory

To determine the proper number of bits in the bucket directory in NEHASH,we perform the follow‐ing tests.First,we change the number of bits of the bucket directory in NEHASH to 1,2,4,6,8,and 10,that is,the corresponding bucket group sizes are 512 B,1 KB,4 KB,16 KB,64 KB,and 256 KB.For comparison,we also set the segment size of CCEH to 512 B,1 KB,4 KB,16 KB,64 KB,and 256 KB.Using 10 million random integers as the workload,the read and write performances of NEHASH and CCEH are tested by YCSB tools.

Fig.9 is the result of changing the bucket direc‐tory bits of NEHASH and the segment size of CCEH to test the search throughput.When the number of digits in the bucket directory is relatively small,NE‐HASH shows a relatively low search throughput.This is because when the number of digits in the bucket directory is small,the number of buckets in the bucket group is also very few,which means that there will be more hash collisions.After a hash collision occurs,NEHASH needs to read the data in the hash collision bucket,and the data in the hash collision bucket are out of order,so NEHASH needs to traverse the search in order,resulting in lower search throughput.When the number of bucket directories reaches 8,the search throughput at this time is 1.53 times that when the bucket directory is 1.This is because as the number of bucket directories increases,the number of buckets in the bucket group also increases,resulting in the probability of conflict being greatly reduced,so there is no need to search for data in the hash conflict bucket,and the data can be obtained directly through the index number in the bucket,thus greatly improving the search throughput.Compared with CCEH,when the number of bucket directories is less than or equal to 8,the search throughput of NEHASH is higher than that of CCEH.This is because NEHASH stores the hash directory in the critical path in DRAM,which can effectively reduce search latency.However,when the number of bucket directories continues to in‐crease to 10,the search throughput will decrease slightly.This is because when the number of bucket directories continues to increase,the number of buck‐ets in the bucket group will double,which will cause considerable management overhead.

Fig.9 Search throughput

Fig.10 shows the result of changing the bucket directory bits of NEHASH and the segment size of CCEH to test the insertion throughput.It can be found that when the number of bucket directories is only 1,the write throughput of NEHASH is very poor and is lower than that of CCEH.This is because when the number of bucket directories is 1,there are only two 256-byte buckets in the bucket group,and the capacity of the bucket group is too small.At this time,bucket group splitting and expansion of the hash directory often occur when writing data,which will greatly reduce the insertion throughput of NEHASH.As the number of bucket directories increases,the number of buckets in the bucket group will increase exponen‐tially,and the frequency of bucket splitting and hash directory expansion will decrease when writing data,so the insertion throughput of NEHASH will also increase accordingly.When the number of bucket directory bits is 6,the write throughput of NEHASH reaches the maximum value,which is 1.93 times that when the number of bucket directory bits is 1.When the segment size and bucket directory bits continue to increase,the insertion throughputs of both CCEH and NEHASH show a downward trend,and the down‐ward trend of NEHASH is more obvious than that of CCEH because when the CCEH segment is larger than 16 KB,segment splitting will result in a large num‐ber of NVM writes,which will reduce the CCEH in‐sertion throughput.When the number of bucket direc‐tories in NEHASH is greater than 6,that is,when the bucket group capacity is greater than 16 KB,multiple hash buckets and additional bucket directories need to be created to split the bucket group,which will generate more NVM write operations than CCEH.Therefore,insertion throughput there will be a clear downward trend in throughput.

Fig.10 Insertion throughput

6.3 Tail latency

To evaluate the overhead problem in the hash scheme,we use 16 threads to test the latency distribu‐tion of each hash scheme under write-and read-intensive workloads,and their cumulative distribution functions(CDFs) are shown in Figs.11 and 12.

Fig.11 shows the distribution of tail latency for a write-intensive workload.LEVL and Cuckoo show relatively poor tail latency because they are based on static hashing schemes.Static hashing schemes require locking the entire hash table during rehashing,which blocks substantial concurrent queries and incurs dra‐matic tail latency.However,NEHASH and CCEH are based on dynamic hashing schemes and therefore ex‐hibit good tail latency.CCEH restricts its 99thpercentile latency to 10.2 μs.However,NEHASH shows the lowest tail latency and restricts its 99thpercentile la‐tency to 6.5 μs.This is because CCEH will perform linear probing when there is a confliction in the hash bucket,which results in additional access to NVM and incurs overhead,while NEHASH avoids the overhead of linear probing through the index in the hash bucket.

Fig.11 Latency distribution of insertion

Fig.12 shows the distribution of tail latency for a read-intensive workload.From this,it can be seen that all four hashing schemes show similar tail delays.Cuckoo shows the worst tail latency because Cuckoo needs to handle hash collisions through two hash func‐tions and will access the hash bucket multiple times,so it will cause serious search overhead.However,NEHASH still shows good tail latency,restricting its 99thpercentile latency to within 1.2 μs.This is due to the bucket directory index in the bucket group.

Fig.12 Latency distribution of search

7 Conclusions

This paper designs a hash directory based on lazy expansion,uses a three-layer hash directory in‐stead of a single-layer hash directory,perceives and adapts to the growth rate of hash keys under the hash directory item,delays the expansion of the hash di‐rectory,and dynamically adjusts the expansion rate of the hash directory.The group-based hash bucket management algorithm is designed,and the bucket group is used instead of a single hash bucket,which can re‐duce the locking granularity when managing hash keys and improve the concurrency of hash key man‐agement.The hierarchical storage strategy and the hash directory recovery algorithm are designed,which can take advantage of respective advantages of DRAM and NVM,so that extendible hashing can effectively manage the massive metadata in the file system.Using Intel Optane DC Persistent Memory,on the basis of its driver,the prototype for high-concurrency extendible hashing for NVM (NEHASH) is realized,and YCSB is used to compare with CCEH,level hashing,and cuck‐oo hashing.It is verified that NEHASH can be im‐proved by up to 16.5% read throughput and 19.3%write throughput.

In future work,we plan to closely integrate NEHASH with the file system to effectively improve the performance of the NVM file system.

Contributors

Tao CAI designed the research.Tao CAI and Pengfei GAO processed the data.Tao CAI drafted the paper.Dejiao NIU,Yueming MA,Tianle LEI,and Jianfei DAI helped organize the paper.Tao CAI and Pengfei GAO revised and finalized the paper.

Compliance with ethics guidelines

Tao CAI,Pengfei GAO,Dejiao NIU,Yueming MA,Tianle LEI,and Jianfei DAI declare that they have no conflict of interest.

Data availability

The data that support the findings of this study are available from the corresponding authors upon reasonable request.