Junwen LuYongsheng HaoLubin ZhengGuanfeng Liu
(1.Xiamen University of Technology,FujianXiamen 361024 China; 2.Nanjing University of information Science & Technology,JiangsuNanjing 210044 China; 3.SooChow University,JiangsuSuzhou 215006 China)
Data Cloud Computing based on LINQ
Junwen Lu1Yongsheng Hao2Lubin Zheng1Guanfeng Liu3
(1.Xiamen University of Technology,FujianXiamen 361024 China; 2.Nanjing University of information Science & Technology,JiangsuNanjing 210044 China; 3.SooChow University,JiangsuSuzhou 215006 China)
Cloud computing has demonstrated that processing very large datasets over commodity cluster can be done simply given the right programming structure. Work to date, the many choices bring difficulty because it is difficult to make a best selection. The LINQ(Language Integrated Query) programming model can be extended to massively-parallel, data-driven computations. It not only provides a seamless transition path from computing on the top transitional stores like relational databases or XML to computing on the Cloud, but also offers an object-oriented, compositional model. In this paper, we introduce LINQ into Cloud and discuss LINQ is a good selection for Data Cloud, and then the detail of file system management was described based on LINQ.
LINQ Datacenter LINQ provider
Cloud Computing refers to a recent trend in Information Technology (IT) that moves computing and data away from desktop and portable PCs into large data centers. The key driving forces behind the emergence of Cloud Computing includes the overcapacity of today’s large corporate data centers, the ubiquity of broadband and wireless networking,the falling cost of storage, and progressive improvements in Internet computing software.
Examples include Google's Google File System(GFS)[1],BigTable and MapReduce infrastructure[2]; Amazon's S3 storage cloud, SimpleDB data cloud, and EC2 compute cloud[3]; and the open source Hadoop system[4], consisting of the Hadoop Distributed File System (HDFS), Hadoop's implementation of MapReduce, and HBase, an implementation of BigTable.
M.Erik points out that the MapReduce programming model enables massively parallel processing through elementary techniques from functional programming. The brilliance behind MapReduce is that many useful data-mining queries can be expressed as the composition of a preprocessing step that parses and filters raw data to extract relevant information (“map”), followed by an aggregation phase that groups and combines the data from the first phase into the final result (“reduce”)[5]. MapReduce alone is too low-level to be productive for non-specialists. Consequently domain-specific languages such as Yahoo!’s PigLatin, Google’s Sawzall, or Microsoft’s SCOPE provide higher-level programming models on top of MapReduce. The common trait across these languages is that they represent a radical departure from the current mainstream programming languages, forcing developers to learn something new. LINQ’s query operators are integrated within popular languages like C# or Visual Basic. It gives a new direction for Cloud and the developer needs not learn nothing new and can program on the Cloud.
The following of this paper is structured as follows: Section 2 introduces LINQ into Cloud architecture and discusses some benefit because of the application of LINQ on Cloud. Section 3 introduces the framework of Data Cloud. Section 4 gives the detail of file management system on cloud,the implementation of a mechanism are based on LINQ. Section 5 concludes this paper.
The many choices bring problems of adoption by developers because of the difficulty of making the best choice. The SQL-like query model has difficulty in adoption because it doesn’t appeal to developers who embraced object-oriented languages like C# or Java. It is also suboptimal for MapReduce computations because it is not fully compositional. This conservative approach is puzzling because recent language and tool innovations such as Language Integrated Query (LINQ) address precisely the problem of compositional programming with data in modern object-oriented languages. M.Erik proposes extending the LINQ programming model to massively-parallel, data-driven computations[5]. LINQ provides a seamless transition path from computing on top of traditional stores like relational databases or XML to computing on the Cloud. It offers an object-oriented, compositional model. Just as the community already built custom LINQ providers for sources such as Amazon, Flickr, or SharePoint,this model will trigger a similar convergence in the space of Cloud-based storage, computation substrates and so no. In fact, we even need not pay attention the message or data transfer protocol is UDP or TCP, because data management software (such as SQL Server,Mysql and so on)can do this for us and we only need pay attention the application of data processing just as we program a web application. With more and more LINQ provider is provided[6], LINQ will make a deep influence in Cloud computing.
Figure 1 shows the framework of Data Cloud based on LINQ. Every user takes Email address as user names and they can submit their data and job from terminal unit. The data Center in change of data management and job scheduling. It is composed of data management software, application program,job scheduling center and so on. Every slave node provides a special service (such as Word, Excel) for all users.
Please noted that the data management software can be SQL Sever, ORACLE and so on, we also can use MY SQL as the data management software which is free for all of us. We know that SQL Server, ORACLE, and MY SQL all have LINQ Provider. In fact, we can set up many data management software in one system to satisfy the user’s requirement.
Datacenter is in change of data management and scheduling. Slave nodes execute the task which is assigned by the Data Center. In special condition, the backup node will take the place of Datacenter and in change of the Datacenter(Fig.1).
In this section, we will discuss the way of Data Cloud based on LINQ. As an example, we will take Microsoft SQL Server 2005 as the database management software. In fact, we can select others database management software, such as Mysql and so on. All of the following will use an object of DataClassesDataContext in c# 3.0, now the definition will be given:
DataCloudDataConte xt datacloud = new DataCloudDataContext();
Moreover, we will need some table for the description the Data Cloud and the relation of them can be seen form Figure 2:
The meaning of every data field will be given in the following sections.
4.1The Definition of User
Just liking Google's Google File System, when someone register on Data Cloud and want to be a user of the system,we need the user to input his Email address and a password. Email address is a very good selection for username. The reason is simple, it is unique and a username can only possess by one special user. And what's more, it can help us solve problem such as security. So we can pay more attentions to solve other problem. In fact, many systems take Email address as username. It is convent for the user too. S. Simon[7]introduces a formal email workflow model based on traditional email, which enables the user to define and execute ad-hoc workflows in an intuitive way. It paves the way for semantic annotation of implicit, well-defined workflows, thus making them explicit and exposing the missing information in a machine processable way. This means Email address will be more useful with more attention given. It is not only an agent for Email but also an identity card for many systems.
4.2The Attributes of Files
In the Data Cloud, the file has six kinds of attributes:
And we can enumerate attributes as:
enum FileAttribute { OnlyRead = 0, Hide=1,FullControl=2, Edit=3, ReadRun=4, Read=5, Others=6 };
So, somebody may decide to change the attribute from OldAttribute to a NewAttribute. The algorithm can be descripting as:
ChanageAttribute(UserID,FileID,OldAttribute,NewAttribute)
{
DataCloudDataConte xt datacloud = new DataCloudDataContext();
var Myfile = datacloud.FileLists.Single(c => c.User_File. FileList.FileID == UserID
&& c.User_File.FileList.FileID == FileID);
Myfile.User_File.FileAttribute = NewAttribute;
datacloud.SubmitChanges();
}
Fig.1
Fig.2
Fig.3
4.3The Operation of Files
Just as Google File System, the owner of a file has the right for deleting, changing and so on.
Deleting: Because the file was saved in the database and it is a record, so we only need delete the record from the database.
Changing: We change the attribute of file, in section 3. 2 a example has been given; we also can change the data of the file:
ChanageData(UserID, FileID, newdata )
{
DataCloudDataConte xt datacloud = new DataCloudDataContext();
var Myfile = datacloud.FileLists.Single(c => c.User_File. FileList.FileID == UserID
&& c.User_File.FileList.FileID == FileID);
Myfile.User_File.FileList.DataSet = newdata;
datacloud.SubmitChanges();
}
In particular, we just discuss a small that only have a record in this section. In section 3.5, we will discuss big file that may have more than one record in the database.
Others:We know that people may move their data from one place to other place, but in our system, all data will be save as a record(s) in the database and database management software in change of the operation. And fact, we can user long filename (such as C:Program FilesBaiDuhys.txt) for the user, so that help the user to manage the file system (“C: Program FilesBaiDu” is no longer a directory but a part file name).
4.4Big Files for Data Cloud
We know that the space of a data field is limit. Then how to deal with the file that the data size is more than the space of a data field Our policy is simple, we will separate the file into many pieces and each piece will be saving as a record of the database. Assume we have 64MB images and the size of a data field in the database is 1 MB. The data (64MB) is stored in 64 records, named RRD1, …, RRD64.The structure can be seen from Figure ().Please note that RRD64,the value of Next data field is“NULL” which is defined as -1 in database(Fig.3).
For the user, most of operations of the file are same as a data only have a record in the database system. Other operations like the depth traversal algorithm and we will give an example for deleting a file. The algorithm can be descripting as:
DeleteFile(long UserID)
{
long nextfileid;
DataCloudDataConte xt datacloud = new DataCloudDataContext();
var Myfile = datacloud.FileLists.Single(c => c.User_File. FileList.FileID == UserID);
nextfileid = Convert.ToInt64(Myfile.Next);
datacloud.FileLists.DeleteOnSubmit(Myfile);
datacloud.SubmitChanges();
if(nextfileid!=-1)
DeleteFile(nextfileid);
}
4.5Message and Data Transfer
We need not pay attention to the protocol of message or data transfer is UPD or TCP, because the database management will do much and which is enough for us. Other service such as message for telling somebody that a new file was authorized can be providing by the Mail system which we have built before many years ago.
Here is an example for returning a big file, and it has several records in the database:
string GetBigFile(long UserID,string str)
{
long nextfileid;
DataCloudDataConte xt datacloud = new DataCloudDataContext();
var Myfile = datacloud.FileLists.Single(c => c.User_File. FileList.FileID == UserID);
nextfileid = Convert.ToInt64(Myfile.Next);
str = str + Myfile.DataSet;
datacloud.SubmitChanges();
if (nextfileid != 0 )
return( GetBigFile(nextfileid, str));
else return str;
}
This paper introduces LINQ into Data Cloud and some details is given. It uses Email address as the identity of user,and then discusses implement method of the attributes of files, the operation of files, big file; at last, the message and data transfer is also given. All of them bases on LINQ. We find that LINQ provide a good way for Data Cloud. In fact,we only give some details and more need be researched in the future. Future work including:
1.The detail of implement method.
2.Security is a problem for us to research. S.Simon[7] introduces a formal email workflow model based on traditional email, which enables the user to define and execute ad-hoc workflows in an intuitive way. We plan introduce the semantic email into the security control of Cloud.
This work supports by JB12189.
Reference:
[1]https://www.google.com/accounts/ServiceLogin service=writely&passive=true&nui=1&continue=http%3A%2F%2Fdocs. google.com%2F&followup=http%3A%2F%2Fdocs.google. com%2F<mpl=homepage&rm=false.
[2]Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, Bigtable: A Distributed Storage System for Structured Data, OSDI'06, Seattle, WA, November, 2006.
[3]Amazon Web Services, http://www.amazon.com/aws.
[4]Hadoop, http://hadoop.apache.org/core.
[5]http://www.cca08.org/papers/Paper30-Erik-Meijer.pdf.
[6]http://hi.baidu.com/haoyongsheng/blog/item/ 8005b7fa4016b7d9b48f3106.html
[7]Simon Scerri, Siegfried Handschuh, and Stefan Decker:” Semantic Email as a Communication Medium for the Social Semantic Desktop”, ESWC 2008, LNCS 5021, pp. 124-138, 2008.