Shengm ei Luo,D i Zhao,W ei Ge,Rong Gu,Chunfeng Yuan,and Yihua Huang(.ZTE Corporation,Nanjing 00,China;.Nanjing University,Nanjing 0046,China)
HMIBase:An Hierarchical Indexing System for em for Storing and Querying Big Data g Data
Shengm ei Luo1,D i Zhao2,W ei Ge2,Rong Gu2,Chunfeng Yuan2,and Yihua Huang2
(1.ZTE Corporation,Nanjing 210012,China;
2.Nanjing University,Nanjing 210046,China)
Relational databasemanagement systems are usually deployed on single⁃nodemachines and have strict limitations in terms of da⁃ta structure.Thismeans they do notwork well with big data,and NoSQL has been proposed as a solution.Tomake data querying more efficient,indexes and memory cache techniques are used in NoSQL databases.In this paper,we propose a hierarchical in⁃dexingmechanism and a prototype distributed data⁃storage system,called HMIBase,which has hierarchical indexes for non⁃prima⁃ry keys in tables and makes data querying more efficient.HMIBase uses HBase as the lower data storage and creates a memory cache formore efficient data transmission.HMIBase supports coprocessor⁃to⁃process update requests.It also provides a clientwith query and update APIs and a server to support RPCs from the client and finish jobs.To improve the cache hit ratio,we propose a memory cache replacement strategy,called Hot Score algorithm,in HMIBase.The experimental results show that Hot Score algo⁃rithm is better than other cache⁃replacement strategies.
NoSQL;In⁃Memory Index;HMIBase;Hot Score
R elational databases have been around since the 1970s and have been an integral partof data stor⁃age and query applicationsofnumerous companies and organizations.In recent years,the speed at which data is created has rapidly increased.A reportof IDC in 2011 shows that the amountofglobal information and will sur⁃pass 1.8 ZB,or 1.8×109 GB,which growing by a factor of 9 in just five years[1].Ithasbecome difficult(arguably impossible) to store all existing and newly created data in RDBMSs be⁃cause an instance of a common RDBMS can only be installed on a single server.Another reason that RDBMSs do notwork wellwith big data is thatmost data is not structured in a way thatmeets RDBMS requirements[1].Not Only SQL(NoSQL) has therefore been proposed for big data storage and querying [2].NoSQL has no limitations in terms of the type of data that can be inputor queried,and it is easy to deploy in a distribut⁃ed system.This makes it particularly suitable for big data. Compared with RDBMS,NoSQL isalso costefficient.Someap⁃plications have storage requirements that cannotbemetby re⁃ lationaldatabases[3].
In thispaper,we propose:
•ahierarchical indexingmechanism
•a prototype distributed data⁃storage system,called HMI⁃Base,which has hierarchical indexes for non⁃primary keys in tablesand enablesefficientdata querying
•amemory cache replacementalgorithm,called Hot Score,in HMIBase.
To implement out system,we use open⁃source Apache HBase[4],which is designed for big data storage and query⁃ing.We use HBase because the data schema in HBase is not as strict as that in RDBMS,and users do not need to tell HBase the length of every property when creating a table.Sec⁃ond,the structure of a table can bemodified even if the table is not empty.Third,HBase is based on Hadoop Distributed File System(HDFS),whichmeans it is easy to scale an HBase cluster from a single server to hundreds of similar nodes.This isdifficult to dowithmostRDBMSs.
To reduce query time,a cache is also needed.In fact,data setqueriesoften conform to the 80⁃20Rule[5];that is,they fol⁃low a Zip⁃F Distribution[6]with some specific arguments. Therefore,storing frequently accessed data in a cache can im⁃prove performance significantly.Because most SQL and NoSQL systems do not support caching,a cache needs to bebuilt on top of a database system.We use Redis[7]for the memory cache because it is based on key⁃value pairs and per⁃forms verywellwith single⁃key queries.Also,mostdata in Re⁃dis is stored in thememory,and the cost in termsofquery time issignificantly less than thatofHBase.
This paper proceeds as follows.In section 2,we introduce hindex,an index system in HBase.Our proposed system has some similarities to hindex.In section 3,we describe HMI⁃Base,itsmodulesand inner implements.In section 4,we intro⁃duce the procedure of HMIBase.In section 5,we describe the architecture of HMIBase.In section 6,provide the results of tests done on the Hot Score algorithm.Section 7 concludes the paper.
2.1 Hindex
J.Mandava et al.introduced hindex to improve the perfor⁃mance of HBase[8].Hindex is fully implemented on the server side implementation along with the HBase coprocessor,which preserves index data in a separate table.Indexing is region⁃wise,and a custom load balancer co⁃locates the index table re⁃gions with actual table regions.HBase with hindex supports multiple indexes on a table and multi⁃column indexes.An in⁃dexmay be based on partof a column value.And then,equal and range condition scans using index if index has been built. Also,indexing can be donewith a bulk load[8].
Hindex has a put operation and scan operation.When a row is put into the HBase’s user table,coprocessors generate the index information and then put it into the corresponding index table.The index table’s rowkey can bespliced as follows[8]:
When scanning on a user⁃table,a scannerwillbe created by coprocessoron the index table.Then coprocessoruse this scan⁃ner to scan the index data,and seeks to extract rows in the us⁃er table.These seeks on HFiles are based on the rowkey ob⁃tained from index data.By doing this,blockswhere data is not present can be skipped[8].However,hindex locates its index on each node ofa cluster,so data querieshave to be sent to ev⁃ery node even if the resultonly exists in one node.Itmay cost toomuch time toexecute thequery fordata.
2.2 Mem ory Cache Strategies
An appropriatememory cache strategy is necessary to dis⁃card some records,improve the hit ratio,and reduce query time.Common cache strategies include first⁃in first⁃out(FIFO) [9],which involves selecting the first record for data replace⁃ment;random,which involves random ly selecting a record to replace;and the Least Recently Used(LRU)[10],which in⁃ volves selecting the least⁃accessed record in a recent period and replacing it.Other strategies,such as TBF,have also been proposed[11].We propose a new strategy that involves the use of Hot Score algorithm.Hot Score uses a property,called“hot score,”to decidewhich record to replace.
3.1 HBase and ItsCoprocessor
HBase is a distributed,fault⁃tolerant,highly scalable,col⁃umn⁃oriented,NoSQL database.HBase is used for real⁃time read/write random⁃access to very large databases[5].It lever⁃ages the distributed data storage of HDFS,a distributed file system runningon thousandsofcomputers.
As with HDFS,the two main parts of HBase are HMaster and HRegionServers,both ofwhich aremanaged by Zookeeper [6].In particular,Zookeeper canmanage configuration informa⁃tion,which is often difficult tomanage in a distributed system. HMaster[1]is the master server of HBase and monitors all HRegionServer[1]instances in the cluster.HMasteralsomoni⁃tors the interface ofallmetadatamodifications.Another impor⁃tantpartof HBase is HRegionServer,which serves andmanag⁃es regions.In HBase,region is an important concept that de⁃scribes the basic element of availability and distribution for HBase tables.A HRegionServer often runs on an HDFS DataNode.
HBase is a NoSQL database.However,it is a quasi⁃relation⁃al database because it lacks some of the important features ofa relational database,e.g.,typed columns and triggers.Also,un⁃like SQL in a relational database,there is no advanced query language based on HBase.An SQL systemmaymaintain an in⁃dex for a primary key in a table to improve querying.However,data in HBase is stored in HDFSandmay be located in differ⁃entmachines.Therefore,it is difficult for HBase to maintain an index similar to thatofan SQL system.
The HBase coprocessor[5]is based on the BigTable copro⁃cessor of Jeff Dean[6].With a BigTable coprocessor,each tab⁃let in the table server can run arbitrary code and supporthigh⁃level call interface for clients.A BigTable coprocessor is a very flexible model for building distributed services and en⁃ables automatic scaling,load balancing,and request routing. Themainmodules in a coprocessor are Observer and Endpoint [5].
An instanceofObserver contains three treeobservers:
1)RegionObserver,which provides hooks for data manipula⁃tion events such asGet,Put,Delete,and Scan.
2)WALObserver,which provides hooks for operations related towrite⁃ahead log(WAL).
3)MasterObserver,which provides hooks for DDL⁃type opera⁃tionssuch asCreate,Delete,and Modify Table.
Endpoint is an interface for dynamic remote procedure call (RPC)extension.Endpoint is installed on the server side andcan be invoked with HBase RPC.The client library provides convenientmethods for invoking dynamic interfaces.
The coprocessor procedure is as follows.The coprocessor initiates RPC invocationsof the registered dynamic protocolon every target table region.The results of those invocations are returned as they become available.The client librarymanages this parallel communication on behalf of the application.The client library managesmessy details,such as retries and er⁃rors,until all results are returned or there is an unrecoverable error.Then the client library rolls up the responses into a Map and hands it over to the application.If an unrecoverable error occurs,an exception will be thrown to the application code so that it can take action.Fig.1 shows a typical process of a co⁃processor.
3.2 Redis
Besides HBase,another importantNoSQL database is Redis [7].Redis is a key⁃value store but is often referred to as a data structure server because keys can contain strings,hashes,lists,setsand sorted sets.By supporting different typesofdata struc⁃tures shown above,Redis can process a wide variety of prob⁃lems that can be naturallymapped intowhat Redis offers.Re⁃dis allows its users to solve their problems without having to perform the conceptual gymnastics required by other databases [13].
In our system,Redis can be used as amemory cache to re⁃duce the timeofqueriesand improve system performance.
We propose a new index mechanism for a system called HMIBase,which is based on HBase with coprocessors.HMI⁃Base has two parts:client and server.The server contains a memory cache,and the client provides query and update APIs similar to HBase.
4.1 HMIBase Index Mechanism
Assumewe query a specific table to obtain a record with its value in a specific column.We create an index table that con⁃ tains only a rowkey but no other columns.The rowkey of in⁃dextable contains the table’s column information,the value in column and rowkey oforigin table.For example,a table called telephoneBook has three columns:id,which contains people’s ids;name,which contains names;and telephone,which con⁃tains telephone numbers.To obtain the name of a person with a specific telephone number,we build the index when input⁃ting the data into the table.The index is constructed as follows: t,number,id.Here,t describes to the column information and can be replaced with a column name,abbreviation,or any oth⁃er content.The number and id is the value in the column for telephone numberand rowkey in telehphoneBook,respectively.
In order to query all values in a specific range,we also pro⁃pose a value table for each column to be queried.Theremay bemore than two value tables for a single table.Referring to the example previouslymentioned in subsection 4.1,the prefix digitsofa telephonenumber indicate the zoneof thenumber;e. g.,a telephone number starting with 025 indicates it comes from Nanjing.We can therefore send a range query starting with 025⁃00000000 and endingwith 025⁃99999999 to search for peoplewho live in Nanjing.The telephone numbers in cor⁃responding value table are sorted in lexicographic order as a rowkey.
▲Figure1.Coprocessor procedure[12].
With this indexmechanism,non⁃primary keys can be que⁃ried by querying two rowkeys.First,the value table is queried; values in a specific range are obtained;and values in the index table are queried in order to obtain the rowkeys.Finally,an or⁃dinary query of the rowkeys will provide the answers.This mechanism is especially suitable to range queries because HBase has to scan all tables to obtain records ofa non⁃primary key.Querying rowkeys results in much better performance much than queryingnon⁃primary keys.
4.2 HMIBase Framework
There are two types of data in HMIBase:hot and cold.The former is data stored in both the memory cache and HBase,and the latter is data stored only in HBase.
Fig.2 shows the HMIBase framework.On the client side,themainmodules in HMIBase are Query Interface,Query Re⁃quest Process and Update Interface.On the server side,the mainmodules in HMIBase are Index⁃Cache in Memory,Data⁃Cache in Memory,and HBasewith coprocessor.
The Query Interfacemodule can receive queries from the us⁃er and return results to the user.In HMIBase,there are two types of query request:pointand range.A point query request can be used to query information abouta specific key whereas a range query can be used to query all information between the lowerand upperboundsof the user’s input.
The procedure of HMIBase is as follows.APIs are used to first send a query or update request.After receiving the re⁃quest from a user,the Query Interfacemodule passes it to the Query Request Processmodule.A point query request is sent directly to the Index⁃Cache in Memory module.However,arange query request is sent to HBase to find the valueswithin the range specified by bounds.Then,the range query request is transformed into a number of point query request and pro⁃cessed one by one.Finally,the results of all point query re⁃quests aremerged and returned to the Query Interfacemodule and then theuser.
▲Figure2.HMIBasearchitecture.
Index⁃Cache in Memory is a cache of the hot data index,and Data⁃Cache in Memory stores the datawhose index are In⁃dex⁃Cached in Memory.Like in other cache systems,the selec⁃tion ofdata from HBase dependson the cache strategy.We pro⁃pose an algorithm to improve system performance.Memory ac⁃cess ismuch faster than HBase access(HBase data is stored in HDFS and HDFS files are stored on the local file system). Therefore,the time spenton a query is reduced.We use Redis for Data⁃Cache in Memory.
If not in memory,a query request is sent to HBase.An HBase module executes a native query and sends the result back.Whether the data of the queried key is stored inmemory dependson the strategy.
The Update Interfacemodule hasupdate APIs,including in⁃sert,delete and update.Themodule sends update requests to HBase.An update request is a trigger to HBase coprocessors. If the data corresponding to an update request is duplicated in the memory cache,the memory cache is temporarily locked and deleted or replacedwith new data.
4.3 HMIBase Imp lementation
In the client,HMIBase implements a query client object for each session.A query client tries to establish a connection with thememory cache serverwhen initiating and disconnects exiting.Then,a user can call query APIs supported by Query ClientObject in order to finish a job.
The query client contains amemory client object that pro⁃cesses the connection with the cache server.Thememory cli⁃ent sends a remote process call to the server and obtains a val⁃ue in return.Both point queries and range queries are pro⁃cessed by same the RPC;the only difference is the argument passed to thequery clientby theuser.
The query clientdoes notsupport Update APIs because up⁃date operations trigger the HBase coprocessor.Current HBase APIs work efficiently in HMIBase,and it is not necessary to implementnew APIs.
HMIBase implements a memory cache server at the server side.The reason thememory cache is called a“server”is that it can receive remote process call requests from the clientand return results,just like a server.When initiated,thememory cache server uses Apache ZooKeeper tomove data to different machine nodes.If a table is created by a user,a global index table corresponding to the table is created simultaneously.Da⁃ta transfer ismuch slower than memory caching,and a global index table can reduce the time spentaccessing each node be⁃cause itmaps the query key to its location.In HMIBase,Zoo⁃Keeper uses a consistenthashing algorithm tomap data to dif⁃ferentnodes[14](Fig.3).
▲Figure3.Consistenthashing.
ZooKeeper arranges nodes in a logical ring,and the loca⁃tions of nodes within this ring are uniform ly distributed.After receiving a remote process call from the memory client or HBase coprocessor,the server tries to hash the query request (including table name,column information,and value)to a val⁃ue of the logical ring.Themetadata of this value is stored in its nextnode,as indicated by the arrows in Fig.3.Ifa nodewithin the cluster is disconnected,the value is mapped to the next node in the logical ring.
HMIBase also focuses on robustness.Maintaining data is usually difficult in a distributed database because different nodes in clustermay getdifferentsystem statuswhen there are multiple operations on the same record.HMIBase adds a syn⁃chronized lock on the server’s update interface.One requestof update can access the data record only after itgets the server’s update lock.An update session can only be initiated because an update lock has been obtained from the server.After a ses⁃sion ends,it unlocks itself and sends an event to ZooKeeper,which then updates the system status.Another session in the waiting queue obtains the lock in order to do its job.
To improve thehit ratioand reducequery time,HMIBaseus⁃es Redis as amemory cache.However,we adopt a new strate⁃gy—the use of Hot Score algorithm in the cache.Hot Score as⁃signs a hot score to each record in HBase.This score is given by:
where countPeriod is the cycle of the Hot Score algorithm,i.e.,the number of query requests between current execution and the next,and visitCount is the number of times a record has been queried during the last countPeriod Query Requests.The parameterαis theweightof the current statistical resultwhen calculating a new hot score.The lastHotScore is the original valueof thehotscoreand hasaweightof1⁃α.
The Hot Score algorithm is part of a memory cache server query procedure,so we describe this algorithm in the function query.The function query with Hot Score algorithm is de⁃scribed in pseudo code in Algorithm 1.
Algorithm 1:Querywith HotScore Algorithm Input:table t,column col,Value value Output:query result res res.key=generate_key(t,col,value); if Cache.contains(res)=true then get res.data from redis; end else get res.data from HBase; end if res.data=null then return res; end Cur_Access_Set.add(res); res.increase_visit_count(); increase_query_counter(); if get_query_counter()=count_period then reset_query_counter(); foreach record in Cur_Access_Set do record.hot_score=record.hot_score*α+ record.get_visit_count()/count_period*(1-α); sort Cur_Access_Set by hot_score order; foreach i in[0,MAX_CACHE_SIZE-1]do
record=Cur_Access_Set.get(i); if Cache.contains(record)=false then Cache.add(res); get record.data from HBase; put record.data into Redis; end end foreach i in[MAX_CACHE_SIZE,Cur_Access_Set.size()⁃1]do record=Cur_Access_Set.get(i); if Cache.contains(record)=true then Cache.remove(res); delete record.data from Redis; end end Cur_Access_Set.clean(); end return res;
In Algorithm 1,Cache is the currentmemory cache,which only needs to be updated when Hot Score algorithm is execut⁃ed.Cur_Access_Set isa setofmetadata ofall the queriesovera period and needs to be cleaned after the Hot Score algorithm has ended.For each query,HMIBase generates a key using the table name,columns,and values input by the user.HMIBase then tries to access the data from Cache using the generated key.If the data cannotbe accessed,it isobtained from HBase. There is no data exchange between the cache memory and HBase unless the Hot Score algorithm is executed.When the number ofqueries reaches countPeriod,Hot Score Algorithm is executed.First,hot scores are calculated for all records ac⁃cessed in Cur_Access_Set,regardless of whether they are in memory or not.This calculation ismade using(1).Then,re⁃cordsare sorted according to theirhotscoresand the Top⁃K re⁃cords are chosen.The chosen records are loaded intomemory and others,even if they remained inmemory during the lastpe⁃riod,are discarded.After Hot Score is executed,Cur_Ac⁃cess_Set is cleaned in preparation for thenextperiod.
We conducted experiments(Table 1)to determine how our methods affect system performance.Brown University’s Ma⁃pReduce database benchmark was used[15]because it sup⁃ ports a Python script.This enables data records to be generat⁃ed asmuch onewants.Tenmillion recordswere generated and stored in HBase(datawas put into a test table).After this,the query time for native HBase,Hindex and our HMIBase was compared.We compared the query time of different cache strategies to see the improvement brought about by Hot Score algorithm.
Wealsogenerated somequery requests,including pointque⁃ry and range query request on this benchmark.The values of point and range query are generated in ZipF distribution[11] and the ranges here are set to be constant values(here,10 and 100).
We ran some random queries,including point queries and range queries(range=10)on HBasewithoutanymodification (native HBase),hindex,and HMIBase.Fig.4 shows results of these querieson the differentsystems.
In Fig.4,the time cost for pointand range queries on HMI⁃Base ismuch shorter than that for point and range queries on HBase and hindex.In HMIBase,time cost for a range query is nearly 10 times that for a point query.HBase uses a filter to query and scansall the records in the table;therefore,the time cost for pointand range queries are similar.Hindex creates an index on each region of HBase;therefore,the time spent on each node becomes shorter,but there is no obvious improve⁃ment in accessing nodes from themaster.With HMIBase,the speed of access to nodes from themaster has an improvement with a factor of five to eight.HMIBase locates the index direct⁃ly and stores hot data in thememory,so it performsmuch bet⁃ter than native HBase and hindex.Because HMIBase executes a range query by executing multiple point queries,the time costofa rangequery dependson thesizeof resultset.
In another experiment,we applied different cache strategies to HMIBase and add strategy asa property in the configuration so that itwas easy to switch between strategies.We executed some point queries and range queries using different strate⁃gies.Fig.5 shows the time costof these queriesusing different cache strategies.
Hot Score algorithm results in faster point and range query⁃ing than other caching strategies.As the size of data increases,the difference in time costbetween Hot Score and other strate⁃gies increases.Themain reason for this difference is that Hot Score algorithm hasa higher hit ratio than other caching strate⁃gies.The hit ratios for each query executed in the experiment are shown in Figs 5(b),(d)and(f).We use a coefficient of 0.2 to allow the size of the cache to increase in linewith the size of queried data.In this way,there is no visible difference in hit ratiowhen thesizeofqueried data increases.
▼Table1.Experimentenvironment
▲Figure4.Experiment results.
◀Figure5. Time costofpointand rangequeriesusing differentcachestrategies.
The hit ratio for LRU ishigher than that for random data se⁃lection because LRU can calculate when the data was last used and store hot data for faster retrieval.However,LRUmay discard some hot data because it is only concerned with the last use of the data.Hot Score,on the other hand,calculates the useof data overawhole period and thushasahigherhit ra⁃tio than LRU.
Another reason that Hot Score algorithm results in fastque⁃rying is that it reduces data transmission.In Redis,read and write operations need to be synchronized in order tomaintain data consistency,and this increases time cost.With random da⁃ta selection or LRU,if not inmemory,thememory cache needs to decidewhich key to exchange and do a read/write operation between Redis and HBase files.Thismeans the synchroniza⁃tion operationsmay increase the time of the query.Hot Score algorithm only exchanges data when the number of queries is large enough,e.g.,10,000.Even though theremay be several updates between HBase and cache,Redis can process these updates in a pipeline because there are no read operations be⁃tween the updates,and only one synchronization operation is required in the end.Therefore,Hot Score also reduces trans⁃mission time.
NoSQL has becomemore and more popular for data storage in recent years.Tomake data queryingmore efficient,indexes andmemory cache techniques have been used in NoSQL data⁃bases.In this paper,we have described HMIBase,a NoSQL system built on Hbase.HMIBase has indexing and has trans⁃parent features and modules.HMIBase uses HBase for data storage and creates a memory cache for more efficient data transmission.HMIBase offers a client query and update APIs aswell as a server to support RPCs from the client and finish jobs.
HMIBase supports a variety of cache strategies for data ex⁃change.These strategies include random,LRU,and Hot Score,a new algorithm introduced in this paper.After the introducing Hot Score,we described how HMIBase is based on an HBase cluster.We then performed experiments on HMIBased using the data cache strategies justmentioned.These experiments re⁃vealed that Hot Score algorithm is a better cache strategy than LRU or random.
Future researchmay be done on NoSQL systems(other than HBase and Redis)thathave not been well researched and the addition ofan exception process procedure for HMIBase in or⁃der to improve the robustness of HMIBase.This will make HMIBasemore suitable for differentscenarios.
[1]L.George,HBase:The Definitive Guide,1st ed.Sebastopol:O’Reilly Media,2011,pp.319-320.
[2]K.Grolinger,W.A.Higashino,A.Tiwari,and M.AMCapretz,“Datamanage⁃ment in cloud environments:NoSQL and NewSQL data stores,”Journal of Cloud Computing,vol.2,no.22,2013.doi:10.1186/2192⁃113X⁃2⁃22.
[3]R.Hecht and S.Jablonski,“NoSQL evaluation:a use case oriented survey,”in 2011 Int.Conf.CSC,Hong Kong,China,pp.336-341.doi:10.1109/ CSC.2011.6138544.
[4]Apache.(2014).ApacheHbase[Online].Available:http://hbase.apache.org
[5]W.A.Britten,“A use statistic for collectionmanagement:the 80/20 rule revisit⁃ed,”Library Acquisitions:Practice&Theory,vol.14,no.2,pp.183-189,1990. doi:10.1016/0364⁃6408(90)90061⁃X.
[6]B.C.Brookes,“The derivation and application of the Bradford⁃Zipf distribu⁃tion,”Journal of Documentation,vol.24,no.4,pp.247-265,1968.doi: 10.1108/eb026457.
[7]S.Sanfilippoand P.Noordhuis.(2010).Redis[Online].Available:http://redis.io/
[8]J.Mandava,R.Chintaguntla and P.Rastogi.(2013).Hindex[Online].Available: https://github.com
[9]R.L.Mattson,J.Gecsei,D.R.Slutz,et al.,“Evaluation techniques for storage hierarchies,”IBMSystemsjournal,vol.9,no.2,pp.78-117,1970.doi:10.1147/ sj.92.0078.
[10]E.G.Coffman and L.Varian,“Further experimental data on the behavior of programs in a paging environment,”Communications of the ACM,vol.11,no. 7,pp.471-474,1968.doi:10.1145/363397.363398.
[11]C.Ungureanu,B.Debnath,S.Rago,et al.,“TBF:A memory⁃efficient replace⁃ment policy for flash⁃based caches,”in 2013 IEEE 29th Int.Conf.Data Engi⁃neering,Brisbane,Australia,pp.1117-1128.doi:10.1109/ ICDE.2013.6544902.
[12]M.Lai,E.Koontz,and A.Purtell.(2012).Coprocessor Introduction[Online]. Available:https://blogs.apache.org/hbase/entry/coprocessor_introduction
[13]J.L.Carlson,Redisin Action.Shelter Island:Manning PublicationsCo.,2013.
[14]D.Karger,E.Lehman,T.Leighton,et al.,“Consistent hashing and random trees:Distributed caching protocols for relieving hot spots on theWorld Wide Web,”in Proc.29th Annual ACMSymp.Theory ofComputing,1997,pp.654-663.doi:10.1145/258533.258660.
[15]A.Pavlo,E.Paulson,A.Rasin,et al.,“A comparison of approaches to large⁃scale data analysis,”in Proc.2009 ACMSIGMOD Int.Conf.Management of Data,Providence,USA,pp.165-178.doi:10.1145/1559845.1559865.
Manuscript received:2014⁃05⁃23
Biographies phies
Shengmei Luo(luo.shengmei@zte.com.cn)received hisMSdegree in telecommuni⁃cation and electronics from Harbin Institute of Technology in 1996.He is a chiefar⁃chitect at ZTE Corporation.His research interests include cloud computing,cloud storage,and bigdata.
Di Zhao(zd08135@126.com)received his BSdegree in computer science and tech⁃nology from Nanjing University in 2012.He is currently amaster’s degree candi⁃date at the Department of Computer Science and Technology,Nanjing University. His research interests include parallel computing and analyzing and processing of bigdata.
W eiGe(gloria.w.ge@qq.com)received her MSdegree from Northeastern University in 2003.She is currently a PhD candidate in computer science atNanjing Universi⁃ty.Her research interests include data management,database query optimization,and big data query optimization.She has published 10 papers in journals and con⁃ference proceedings,including in Science in China(series F),APWeb 2003,and JournalofElectronics.
Rong Gu(gurongwalker@gmail.com)received his BS degree in computer science from Nanjing University of Aeronautics and Astronautics in 2011.He is currently a PhD candidate in computer science atNanjing University.His research interests in⁃clude parallel and distributed computing,cloud computing,and big⁃data parallel processing.
Chunfeng Yuan(cfyuan@nju.edu.cn)is a professor at the Department of Computer Science,Nanjing University.She received her BSand MSdegrees in computer sci⁃ence from Nanjing University.Hermain research interests include compute system architecture,big data parallelprocessing,and Web informationmining.
Yihua Huang(yhuang@nju.edu.cn)is a professor at the Department of Computer Science,Nanjing University.He received his BS,MS,and PhD degrees in computer science from Nanjing University.His research interests include paralleland distrib⁃uted computing,big⁃data parallelprocessing,and Web informationmining.
ange query is
,HMIBase turns to HBase to search for values within the range.HMIBase uses a first⁃key filter[1]to search the index table and obtain suitable values within the range in HBase.Then,HMIBase executesa number ofvaluequeries.
10.3969/j.issn.1673-5188.2014.04.002
http://www.cnki.net/kcm s/detail/34.1294.TN.20141127.1518.002.htm l,pub lished online November 27,2014
This wo rk is suppo rted by China National Science Foundation(Grant 61223003)and ZTE Industry-Academ ia-Research Cooperation Funds.