Mohammed BinJubier,Mohd Arfian Ismail,Abdulghani Ali Ahmedand Ali Safaa Sadiq
1Faculty of Computing,Universiti Malaysia Pahang,Kuantan,Pahang,Malaysia
2School of Computer Science and Informatics,De Montfort University,Leicester,LE1 9BH,United Kingdom
3School of Engineering,Computing and Mathematical Sciences,University of Wolverhampton,Wulfruna Street Wolverhampton,WV1 1LY,United Kingdom
Abstract: Publishing big data and making it accessible to researchers is important for knowledge building as it helps in applying highly efficient methods to plan,conduct,and assess scientific research.However,publishing and processing big data poses a privacy concern related to protecting individuals’sensitive information while maintaining the usability of the published data.Several anonymization methods,such as slicing and merging,have been designed as solutions to the privacy concerns for publishing big data.However,the major drawback of merging and slicing is the random permutation procedure,which does not always guarantee complete protection against attribute or membership disclosure.Moreover,merging procedures may generate many fake tuples,leading to a loss of data utility and subsequent erroneous knowledge extraction.This study therefore proposes a slicingbased enhanced method for privacy-preserving big data publishing while maintaining the data utility.In particular,the proposed method distributes the data into horizontal and vertical partitions.The lower and upper protection levels are then used to identify the unique and identical attributes’values.The unique and identical attributes are swapped to ensure the published big data is protected from disclosure risks.The outcome of the experiments demonstrates that the proposed method could maintain data utility and provide stronger privacy preservation.
Keywords: Big data;big data privacy preservation;anonymization;data publishing
The vast influence of emerging computing techniques has encouraged the generation of large data volumes in the past few years,leading to the trending concept known as“big data”[1,2].Data publishing assists many research institutions in running big data analytic operations to reveal the information embedded and provide several opportunities with great unprecedented benefits in many fields[3].This process helps organizations improve their efficiency and future plans[1,4-6].Analyzing big data and extracting new knowledge while protecting sensitive information is now considered as one of the imperative needs[7].Moreover,much attention has been paid to potential data privacy violations and data misuse;hence,the proper protection of released data must be ensured because failure may lead to harmful situations impact to individuals and organizations[4].Many establishments such as educational institutes and healthcare centers need to publish data in different formats to extract new knowledge[7].
Data publication is the easiest method for data sharing that helps research entities run data mining operations on published databases to extract knowledge from the published data.Such knowledge can represent,interpret,or discover interesting patterns[7,8].However,the potentials of published partial data derived from big or a series of datasets are yet to be realized.Scholars face several problems during knowledge extraction process from the published data.One of such challenges is the issue related to data privacy that leads to the disclosure of individuals’identities.This issue is threatening the secure propagation of private data over the web.It has been the reason to limit the availability of large datasets to researchers[9].One of the common practices and most widely used for providing privacy for individuals is the anonymization approach of data prior to its publication.Data anonymization aims to reduce the associated risk of disclosing information of individuals and preserves the possible utilization of published data[10].Though,this approach remains holds two main open questions:1)Can anonymized data be effectively used for data mining operations?2)What protection is needed to prevent private information disclosure while preserving data utility?[11].
There are two popular models have been proposed for data publication [12]:(1) Multiple publication models from the same data publisher.Multiple data publications refer to a series of datasets in distinct timestamps that are all extensions in certain aspects(e.g.,quarterly released data)[8,13].When the datasets come from the same publisher,this implies that the publisher knows all the original data.(2) Single publication model from several data publishers.Several privacy approaches exist[14]for preserving data privacy.However,majority of these approaches focus mainly on a single publication[12,15,16],where the publisher anonymizes the dataset without considering other datasets that have been published.
In both models,there are two fundamental methods for releasing the published data.The first method is an interactive setting in which the data collector computes some function on the big data to answer the queries posed by the data analyzer.The second method is the non-interactive setting in which the big data is sanitized and then published [17].It is worth noting that in our study,we consider the scenario of a single publication model in the non-interactive setting where the big data are sanitized and independently published by many organizations(data collectors)that share several common individual records.The issue with this assumption is that in several cases,the information of an individual may be published by more than one organization[18],and an attacker may launch a composition attack[12,19]on the published data to alter their privacy.
The attributes that cover more than one organization may publish to create links,such as sex,age,and zip code,are called quasi-identifiers(QIs).A composition attack is a situation where an intruder tries to identify an individual by linking several available attributes(QIs)in the published data to an external database to exploit sensitive information [12,20-22].Therefore,anonymization can only be achieved by altering these attributes to conceal the linkage between the individual and specific values to avoid such attacks and preserve the possible utilization of the published data [12].The common method to sanitize the database while maintaining data utility is data anonymization,which is defined in[11]as a set of methods to reduce the risk of disclosing information on individuals,businesses,or other organizations.Most of the existing anonymization-based methods work by setting protection methods,such as perturbing [22,23],suppressing or generalizing variable values [13],or preserving privacy based on measures of correlation[12,24].The main aim of these methods is to create some sort of uncertainty in assessing identity inference or sensitive value [11].Besides,this protection method aims to weaken the linkage between the QI values and sensitive attribute (SA) values such that an individual cannot be identified with his/her sensitive values.
The single publication model has several correlated attributes rather than a single column distribution to achieve exceptional new knowledge results [3].Suppressing or generalizing methods rearrange the data distributions to execute mining for privacy preservation,which involves analyzing each dimension separately,overlooking the correlations among various attributes (dimensions)[25].Preserving privacy based on the perturbation method alters the original values of a dataset to its anonymized version,which leads to data utility problems depending on the amount and type of noise or the specific properties of data that are not preserved[7].
The clever approach to resolving these problems is to measure correlation to improve the protection and enrich data utility.The association is measured by a correlation coefficient,denoted byr,which plays a major role in data science techniques that measure the strength of association between variables;hence,the choice of a particular similarity measure can be a major cause of the success or failure in some of classification and clustering algorithms[26].
The Pearson Correlation Coefficient(PCC)and Mean Square Contingency Coefficient(MSCC)are the two commonly used measures in identifying association[24,27,28].PCC is used to determine the strength of a linear relationship between two continuous variables.The value of the coefficientrranges from[-1,+1][27].When the value ofris-1 or+1,a perfect linear relationship exists between the considered variables.However,if the value is 0,this infers no linear relationship exists between the pairs of variables.An MSCC is a chi-square measure of the correlation between two categorical attributes.Unlike PCC,chi-square measures the extent of the significance of the relationship instead of measuring the strength of the relationship.
The idea behind the measure of correlation is to keep data utility via grouping highly correlated attributes together in columns and preserving the correlations between these attributes.The correlation measure protects privacy as it breaks the associations between uncorrelated attributes in other columns via anonymization approaches such as randomly permutated and generalization[12,24].
In this study,ideas are pooled from [12,24]to propose an effective method of determining the level of data protection needed and knowing the optimal manner to achieve this protection level whilst preserving data utility.Both are achieved by using slicing in the anonymization approach for data publishing using vertical partitioning (attribute grouping) and horizontal partitioning (tuple partition).The lower protection level(LPL)and upper protection level(UPL)are used to overcome the unique attributes and presence of identical data for data privacy protection whilst preserving data utility.LPLovercomes the unique attribute values,whereasUPLovercomes the high identical attribute values.LPLand UPL define the level of protection around the attribute values and ensure that an attacker cannot obtain the sensitive information needed to identify the record owner within such interval.This work also relies on value swapping to ensure a lower risk of attribute disclosure and l-diverse slicing.The proposed approach ensures that the published big data is protected from disclosure risks.The outcome of the experiments show that the UL method could keep more data utility and provide a stronger privacy preservation.
This paper’s major contribution is the proposed upper and lower-level-based protection method(UL method) for data anonymization.The UL method better balances privacy,information loss,and utility.That is why the level of protection required and the optimal manner of achieving it are determined while preserving data utility using the lower and upper protection levels.This work also relies on rank swapping to guarantee a lower risk of attribute disclosure,achieve aggregate query and l-diverse inside the table and solve the problem of creating invalid tuples.
The rest of this paper is organized as follows:Section 2 reviews the related work.Section 3 presents in detail the UL method.Section 4 discusses the experimental analysis.Finally,Section 5 concludes the paper and highlights the key findings.
The most favourable approaches for preserving privacy based on the suppressing or generalizing method and anonymization of the data include the k-anonymity approach[29],l-diversity approach[30],and the T-closeness approach [15].These approaches were proposed for privacy preservation in one-time data publishing.These methods take personal data and anonymise it and make it unattributable to any specific source or person by breaking the relations amongst the attribute values.High dimensionality renders these approaches ineffective because the identities of the primary record holders can be unmasked by merging the data with either public(composition attack)or background information[12,31].Readers can refer to[5,7,32-34]for more comprehensive understanding of these approaches.
In the last decade,the probabilistic approach [35],e-differential privacy approach (e-DP) [36],hybrid approach [31],and composition [37]-preserving privacy based on the perturbation method were proposed for multiple independent data publishing.Composition is the first privacy model to prevent composition attacks in multiple independent data publishing [12].The proposed approach in [37]has integrated two novel concepts:(ρ,α)-anonymization by sampling and compositionbased generalization for independent datasets to protect against composition attacks.The proposed approach in [31]combined sampling,generalisation and perturbation by adding Laplacian noise to the count of every sensitive value in each equivalence class.The probabilistic approach suggests a new method called(d,α)-linkable.It tries to limit the likelihood of an adversary completing a composition attack by ensuring that thedpersonal values are associated with a quasi-identifying group with a probability ofαby exploring the correlation between the QI attributes and SAs.
Mohammed[36]proposed the first noninteractive-based approach called e-DP based on the generalization method.The proposed solution produces a generalized contingency table and adds noise to the counts.The e-DP provides a strong privacy guarantee for statistical query answering and protection against composition attacks by differential privacy-based data anonymization[12,19,31,38,39]showed that using e-DP to protect against composition attacks generates substantial data utility losses during anonymization.
The most recent measure correlation-based methods are slicing [24]and merging [12].Slicing has received substantial attention for privacy-preserving data publishing,which is considered a novel data anonymization approach.The authors presented a risk disclosure prevention concept that is devoid of generalization.Random slicing permutates the values of attributes in the bucket to annul the column-wise relationships.This method protects the privacy of the published records from attribute and membership disclosure risks.In addition,slicing is recommended for high-dimensional data anonymization because it keeps more data utility than the generalization of attribute values.Therefore,slicing ensures data privacy and preserves data utilities because the attribute values are not generalized.It uses vertical partitioning(attribute grouping)and horizontal partitioning(tuple partition),and its sliced table should be randomly permutated[24](see Tab.1).
However,slicing can cause data utility and privacy-preserving problems,as slicing randomly permutates attribute values in each bucket,creating invalid tuples that negatively affect the utility of the published microdata.The invalid tuples may easily result in several errors and incorrect results in process challenges.An attacker can rely on the analysis of the fake tuples in the published table to capture the concept of the deployed anonymization mechanism,having the chance to violate the privacy of published data[5,7,40].
For instance,in Tab.1,tuplet1has just one matching equivalence class that is linked with two sensitive values for zip code 130350.Here,any person may be linked with sensitive values with a probability of not more than 1/l via l-diverse slicing because slicing has been shown to satisfy l-diverse slicing by being linked with the sensitive values by 1/2.If the QI attribute,namely,the zip code is revealed because it has high identical attribute values (sufficient variety) and an adversary relies on background knowledge and has a knowledge of(23,M),then the adversary can determine the SA for the individual.Moreover,if the slicing algorithm switches the sensitive value (randomly) betweent1andt2,then incompatibility is created between the SA and QI attribute values,as mentioned in[40].
Hasan et al.[12]designed the merging approach to protecting personal identity from disclosure.It is considered an extension of slicing approach.The primary aim of the merging approach is privacy preservation in multiple independent data publications via cell generalization and random attribute value permutation to break the linkage between different columns.To compute data utility and privacy risks,the merging approach that preserves data utility has minor risks because it increases the false matches in the published datasets.However,the major drawback of merging is the random permutation procedure for attribute values to break the association between columns.Besides increasing the false matches for unique attributes in the published datasets,these procedures may generate a small fraction of fake tuples but result in many matching buckets(more than the original tuples).This will eventually lead to loss data utility and can produce erroneous or infeasible extraction of knowledge through data mining operations[41,42].Therefore,the primary reason for revealing people’s identity is the existence of unique attributes in the table or allowing several attributes in the row to match the attributes in other rows,leading to the possibility of accurately extracting the attributes of a person[7,12,24].
Other studies[8,24]proved the importance of allowing a tuple to match multiple buckets to ensure protection against attribute and membership disclosure.This finding implies that mapping the records of an individual to over one equivalence class results in the formation of a super equivalence class from the set of equivalence classes.
In this study,the proposed UL method preserves the privacy of the published data while maintaining its utility.The UL method uses the upper level to overcome the identical high values in every equivalence class.However,it uses the lower level to overcome the unique attributes found in every equivalence class.It also uses swapping to break the linkage between the unique attributes and the attributes with identical high values to improve the diversity in our work and increase personal privacy.Worth mentioning that the attributes have been generalised to the switching ability of them and the associated issues.The primary goal of swapping or generalizing the attribute values is to get anonymized data.
This section presents the UL method to be used in the enhanced protection method of data publishing while maintaining data utility.The proposed method reduces the risk of a composition attack when multiple organizations independently release anonymized data.The primary goal of this work is to get a specified level of privacy with minimum information loss for the intended data mining operations.The UL method proposed comprises four main stages,as illustrated in Fig.1.The following four subsections describe these four stages.
Figure 1:General block diagram of the UL method
A standard machine learning dataset known as the“Adult”dataset was used for the experiments.This dataset was assembled by Ronny Kohavi and Barry Becker and drawn from the 1994 United States Census Bureau data[43].The dataset comprised 48,842 tuples with fifteen QI attribute values.
The utilized tableThasaiattributes,wherei=1,2,...n.The highly correlated attributes are clustered into columns and uncorrelated attributes are in the other columns,such that each attributeaibelongs to one subset.Colicolumns{col1,col2,...coln,}contain all the attributesai.The grouping of the related attributes is based on the inter attribute relationship measurement,which is ideal for privacy and utility.Regarding data utility,the grouping of highly correlated attributes ensures the preservation of their interattribute relationships.However,in terms of privacy,the identification risk is relatively higher due to the association of uncorrelated attributes compared with the association of highly correlated attributes because of the less frequent association of uncorrelated attribute values;hence,they are more identifiable.For privacy protection,breaking the linkages between the uncorrelated attributes is better [24].The appropriate measure of association for this situation is MSCC because most of the attributes are categorical.Assume attributea1with value domain {v11,v12,...v1d1,},attributea2with value domain {v21,v22,...v2d2,},and their domain sizes ared1andd2,respectively.The MSCC betweena1anda2is defined as follows:
wherer2(a1,a2)is the MSCC betweena1anda2attributes;fi.andf.jare the fractions of occurrence ofv1iandv2jin the data,respectively;andfijis the fraction of cooccurrence ofv1iandv2jin the data.Therefore,fi.andf.jare the marginal totals offij:
Table 1:Published data by slicing
Table 2:Example of partitions in table T
Table 3:Five changes of swap rates for LPL to calculate the number of cells and tuples in each change
Given the computation of correlation (r) for each pair of the attributes,the dataset is vertically and horizontally partitioned in the table.In vertical partitions,the k-medoid clustering algorithm,also known as the partitioning around medoids algorithm [44],is used to arrange the related attributes into columns such that each attribute belongs to one column.This algorithm ensures each attribute is determined just as a point in the cluster space,and the inter-attribute distance in the cluster space is given as,which ranges from 0 to 1.When the two attributes are strongly correlated,the distance between the related data points will be smaller in the clustering space.
Tab.2 shows three partitions for the columns:(1)T*contains all columns with highly correlated attributescol*,wherecol*={col*1,,col2*,...coli*},andcol*∈T*.(2)T**contains all columns with unc orrelated attributescol**,wherecol**=contains columns with SAcolcwhen a single SA exists,and its SA is placed in the last position for easy representation,wherecolc∈Tcand(T*∩T**)∩Tc=T,(i=1,2...n).If the data contain multiple SAs,one can either consider them separately or consider their joint distribution[45].
In the horizontal partition,all tuples that contain identical values are grouped into buckets or equivalence classes.Each individual is linked with one distinct sensitive value,such that an attacker could not have access to the person’s sensitive values with a probability of over 1/l.The Mondrian[46]algorithm is used to group the tuples.
This stage explains the data protection method proposed in this study using the UL method.It is an opportunity to improve the protection level and resolve issues on privacy with the preservation of data utility via two steps:
3.4.1 Creation of Protection Levels
The key parameters used to improve the protection level in slicing includeLPLandUPL.This study usedLPLandUPLto overcome the unique attributes,and the attributes have identical high values in every equivalence class.Both protection levels define the protection interval around the unique attribute values and identical high values,that fall within this period inT**,such that the attacker would find deducing sensitive information difficult and identifying the record owner within such an interval impossible.The lower levels overcome the unique attribute values,whereas the upper levels overcome the high values identical to the individual’s privacy protection.Suppose the cells have high values of the correlation coefficient(r).In that case,the probability of cells is in the same equivalence class,and by linking these cells with other cells inT*,the adversary has high confidence around the SA,leading to a privacy breach.The rest of the cells are protected from attribute disclosure and membership disclosure because of their presence in over one equivalence class.The proposed privacy goal further requires the range of the rest of the cell groups to be larger than a certain threshold(containing diversity that is at least ≥2 in each equivalence class(see algorithm 1).The upper and lower protection levels(UPLandLPL)aim to find a set of unique cell values and high identical values for cells fromT**,which are presumed known to any attacker:
The attributes that fall within this period,which will be swapped,are called the swapping attributes,andandare the numbers of cells that fall within this period.Values that have been initially marked to be swapped are called swap rate,denoted byΦ.Typically,Φis of the order of 1%-10%;thus,the fraction of attributes swapped will be less than one.
Definition 1 (Cell):A cell is a pair of attributes,such as (age,gender),where any cellCcol,Eis identified by the number of columnsColiand the number of an equivalence classEj.For example,in Tab.1,any cell in column {(Age,Gender)} is identified byColiandEj,where 1 ≤i≤coland 1 ≤j≤Eand the first equivalence class consists of tuples t={t1,t2,t3,t4}.
Definition 2(Matching Buckets):Letcol**be the columns,wherecol**={col*1*,,col2**,...coln**},andcol**∈T**.Lett**be a tuple,andt**|coli**|be thecoli**value oft**.LetE*be an equivalence class in the tableT**,andE**|coli**|be the multiset ofcoli**values in the equivalence classE**.E**is a matching bucket oft**iff for all 1 ≤i≤col**,t**|coli**|∈E**|coli**|.
Definition 3 (Lower and Upper Protection Level):LPLandUPLare correlation coefficient (r)values for each cellinLPL and UPL∈r.
Algorithm 1:Creation of Protection Levels attributes Input:Table T Output:Defining a set of attributes a**i that contain value of r that fall in Ccol,E and Ccol,E 1.attribute grouping-stage 2 2.table partition-stage 3 3. for each equivalence class E in T**do 4.Ccol,E:correlation coefficient(r)for attributes in Φ ≤a**i<1.0 5.Ccol,E:correlation coefficient(r)for attributes is in 0.0<a**i ≤Φ 6.Ccol,E:correlation coefficient(r)for attributes in Ccol,E <a**i<Ccol,E 7.Swapping or generalisation of attributes a**Ccol,E (algorihm 2)8.Swapping or generalisation of attributes a**i in(algorithm 2)9.Ensure the l-diversity of all equivalence classes to satisfy privacy requirement as in[24].i in Ccol,E
Given the computation of correlation (r) for each pair of the attributes,attributea*i*values are grouped into three groups:1)contains all attribute values,that have a correlation coefficient(r)falling within this periodΦ≤a*i*<1.0(seeLine 4).2)contains all unique attribute values,that have a correlation coefficient(r)falling within this period 0.0<a*i*≤Φ(seeline 5).3)Ccol,Econtains the rest of the cells that contain a distant association value fromandand fall within this period(seeline 6).Ccol,Eare characterized by the presence probability it’s in multiple equivalence classes,which leads to the prevention of attribute disclosure.Line 9is a check for the ldiversity privacy requirement as in slicing[24].Moreover,these cells must contain diversity that is at least greater than or equal to two(diversity ≥2)and distributed in each equivalence class.
3.4.2 Swapping or Generalisation of Attributes
Swapping or generalization of attributes is the anonymization stage,where randomly permutated values in an equivalence class may not be protected from attribute or membership disclosure because the permutation of these values increases the risk of attribute disclosure,rather than ensuring privacy[40].Therefore,the proposed algorithm in this study ensures the privacy requirement in each equivalence class.Rank swapping is used to break the linkage between the unique attributes and the cells with identical high values to improve the diversity in slicing and increase personal privacy.Attribute swapping alters the tuple data with unique attribute values or identical high values by switching the values of attributes across pairs of records in a fraction of the original data.With not being able to swap attributes,the attributes have to be generalized.The primary goal of switching or generalizing the attribute values is to get the anonymized tableT,which would not have any nonsensical combinations in the record (invalid tuples) and would satisfy the l-diverse slicing (see Algorithm 2).
Algorithm 2:Swapping or Generalisation of Attributes Input:Table T Output:Obtain the Anonymized Table T*1.Check if swapped attributes are in the same rank group.2.Check if the tuple does not have any nonsensical combination.3.Swap the attributes values to satisfy k-anonymity.4. else 5.Generalize the attributes value to satisfy k-anonymity.
To ensure the integrity of attribute swapping,the values of an attributea*i*are ranked in groups,for example,Level0in Fig.2 has two groups:{Federal-gov,Local-gov,State-gov}and{Self-emp-inc,Self-emp-not-inc}.In line 3,the value is swapped between two attributes if the two attributes are in the same rank group and have no nonsensical combinations.If the two attributes are in different groups or if the records have any nonsensical combination,the attribute values are generalized to satisfy k-anonymity (seeline 5).Hence,the whole equivalence class is not generalized during attribute generalization;hence,it provides an opportunity to improve data utility compared with full table or column generalization.It also improves the utility of the published dataset.In addition,attribute swapping or generalization provides greater information veracity when deciding.Veracity is the reliability of data and represents the meaningfulness of depending on such data for data mining operations[12,40].
Figure 2:Example of domain (left) and value (right) generalization hierarchies for the work-class attributes
Definition 4(Attribute Generalization):LetT**be part of tableT,anda*i*be a QI attribute set inT**.Generalisation replaces the QI attribute values with their generalised version.Letdi**anddj**be two domains with dimensional regionsandrespectively,whereIf the values ofdj**are the generalisation of the values in domaindi**,denotedi**<dj**(a many-to-one value generalisation approach).Generalisation is based on a domain generalisation hierarchy that is defined as a set of domains whose ordering is totally based on the relationshipdi**<dj**(see Fig.2).
Fig.2 (right) shows a domain generalization hierarchy for the work-class (WC) attributes.No generalization is applied at the bottom of the domain generalization hierarchy for the WC attributes.However,the WC is increasingly more general in the higher hierarchy levels.The maximal domain level element is a singleton,which signifies the possibility of generalizing the values in each domain to a single value.
The Adult dataset,which included a real dataset,was used[43].This experiment was implemented using the Python language.To perform the experiments,independent datasets were needed to simulate the actual independent data publishing scenario.Five disjoint datasets of different sizes were pooled from the Adult dataset and extracted into two independent datasets called the Education and Occupation dataset with eight QI attribute values:age (continuous,74),marital status (categorical,7),sex(categorical,2),work class(categorical,8),salary(categorical,2),relationship(categorical,6),education (categorical,16) and occupation (categorical,14).The values in the parenthesis show the type of attribute and the number of classifiers for each attribute.
Each dataset has 4 K tuples that were randomly selected.The remaining 8 K tuples were used to generate the overlapping tuple pool and check for composition attacks.Five copies were made for each group of the remaining tuple pool by inserting 100,200,300,400,and 500 tuples into the Education and Occupation datasets to generate datasets with sizes of 4.1,4.2K,4.3K,4.4K,and 4.5(where K=1000)for the Education and Occupation dataset,respectively.
The experiments on real datasets were presented in two parts.The first part has measured the desired level of protection.In the second part,the proposed method was tested for effectiveness against composition attacks,and the effectiveness of the proposed method in data utility preservation and privacy were evaluated compared with other existing works.The experimental results showed that the proposed method provided privacy protections against the considered attacks by maintaining good level of data utility.
The desired level of protection was determined by determining the unique attributes and grouping the identical data(matching of attributes)into tables.As mentioned earlier,the correlation coefficient(r)plays a significant role in determining the strength of the relationship between attributes.TheLPLdetermines all cells with unique attribute values,and the values of the attributes fall between the range of 0.0<LPL≤Φ.The value of r for unique attributes is always close to 0 but does not equal 0.UPLdetermines all cells with many matching attributes of which their values fall withinΦ≤UPL<1.0.The value of r for these matching attributes is always close to 1 but not equal to 1.The rest of the cells containing a distant association value from 1 and 0 are characterized by the probability of multiple equivalence classes,leading to the prevention of attribute disclosure.Moreover,these cells must contain at least greater than or equal two(diversity ≥2)and distribute in each equivalence class.
The attributes that fall within this period(LPLandUPL)that will be swapped are called swapping attributes,whereas the values that are marked for swapping and considered a measure of privacy are called the swap rate and denoted byΦ.The decision-maker must specify this based on the disclosure risk and data utility by looking at the measures of the strength of the relationship between attributes.
Using the experiment datasets partitioned according to Tab.2 and based on the Education dataset,five swap rates were performed on partitionsT**to find the number of cells and tuples in eachLPLandUPL.Tab.3 tabulates the number of cells that fall in the tuples that contain the swapping attributes.Cells with unique attributes or near-unique attributes are potentially riskier than other elements.Tab.4 lists the number of cells that fall in the tuples with the matching attributes,not variety.Cells with matching attributes or near matching attributes are riskier than other elements because almost all tuples are in the same equivalence class.The adversary has more confidence around the SAs by linking these attributes with the highly correlated attributes or other datasets.
2 302 0 34 0 36 0.85#of tuples for each cell 1 1626 1674 1412 1787 1780 0.90Φ=#of cell 2 30303 1 22222#of tuples for each cell 2 264 0000 1 1412 1787 1626 1674 1780 10000#of cells 22222 2 1.0 1#of tuples for 0.95Φ=00000 each cell 2 UPL<1 1626 1674 1412 1787 1780 Data setΦ ≤0.98Φ=#of cells 2 00000 1 22222#of tuples for each cell 2 00000 0.99Φ=1 1626 1674 1412 1787 1780#of cells 2 00000 1 22222#of tuples for each cell 2 00000 Φ=1 1362 1412 1412 1501 1555#of cells 2 00000 1 11211 4.1K 4.2K 4.3K 4.4K 4.5K
The strength of the association between attributes was used because the strength and variety of data was known.Then,LPLandUPLwere used to find the specific attributes to swap between them instead of a random approach to breaking the correlations between the attribute values.This method provided more variety of data in the equivalence class.A higher swap rate(Φ)in Tab.3 or a lower swap rate in Tab.4 means higher privacy but decreased data utility.
From many data publishers,the Single publication model is considered a non-interactive data publishing used in experimental analysis.The experiment was carried out in non-interactive privacy settings.However,most of the work in differential privacy[47]is in line with the interactive settings;a user can gain access to the data set using a numerical query as the anonymization technique will add noise to query answers.The environment may not always favor this phenomenon because datasets are usually known to be published in public.As a result,the non-interactive setting was chosen for the experiment on differential privacy,which is highlighted in[36].
This section contains the assessment of the proposed work,which is achieved through the measurement of its efficiency using the hybrid [31],merging [12],e-DP [36],probabilistic [35],Mondrian[46],and composition[37]approaches,in the non-interactive privacy settings.The quasiidentifier equivalence class was given as k-anonymity[16]by the merging,probabilistic,e-DP,hybrid,composition,and Mondrian approaches.To create an equivalence class,k=6 was chosen,where Ldiversity is also given as 6.The main purpose of L-diversity is the preservation of privacy by expanding sensitive values’diversity.The Laplacian noise in an equivalence class for differential privacy is appended to the count of sensitive values[35]with e=0.3 for the e-differential privacy budget.There are basically two factors upon which comparison can be made.One is the data utility,while the other is the risk disclosure.These factors are discussed in the subsections below.
4.2.1 Data Utility Comparison
Privacy preservation is an essential issue in tableTpublication;hence,data utility must also be considered.Data quality is measured based on distortion ratio (DR).TheDRin published data can be measured using several methods [13]to quantify the effect of anonymization on the overall data distortion for data mining.The generalized distortion ratioGDRis one appropriate measure for calculating the[42].The swap and generalize method are used to break the association of the attributes because most of the attributes are categorical.For any two categorical attributes(a*1*,a*2*∈T**),wheretis its taxonomy tree and a nodepintis used to swap or generalise the attributes,theDRwithpis defined as follows:
where|N|denotes the set of all the leaf nodes int,and|Common(a*1*,a*2*)|is the set of leaf nodes in the lowest common tree ofa*1*anda*2*int.
Fig.2 denotes the taxonomy of the WC attributes.If the values ofa*i*anda*j*are in the same rank group and have no nonsensical combinations,then their swap values are equal,and theDRi s 0.Moreover,if the values ofa*i*andare not in the same rank group or have any nonsensical combinations,then,their generalised values are equal toand theDRis equal towheredj,kis the distortion of the attributeof tupletk.
DRis proportional to the distortion of the generalised dataset over the distortion of the fully generalized dataset.Data utility can be estimated by subtractingDRin Eq.(3)[13]as follows:
Figs.3 and 4 show the results of the experiments on data utility,that is made based on data loss on the Education dataset.The proposed work in Fig.3 had a swap rate (Φ) of 2% usingLPLand 98% usingUPL.The proposed work had a swap rate (Φ) of 5% usingLPLand 95% usingUPLin Fig.4.Decision-makers must select the swap rate to determine the protection level required by looking into the changes in swap rates,which helps know the number of cells in each swap rate (Φ)(see Tabs.3 and 4).An increase in swap rate (Φ) inLPLor decrease in the swap rate (Φ) inUPLenhances the privacy while the data utility becomes lower.The assessment of the proposed work,done through its comparison with hybrid[31],merging[12],e-DP[36],probabilistic[35],Mondrian[46],and composition[37]approaches revealed that the data utility obtained by the UL is higher than that of all the known works.Whereas,merging approach had N fake tuples with the same QI values as in the original table,and the sensitive values were assigned to them based on the sensitive value distribution in the initial dataset.Therefore,the proposed approach resulted in lesser data loss than the merging method.The UL method employs selective generalization within the cell when satisfying the privacy requirements is essential;hence,more data utility is preserved.
Figure 3:Data utility on the Education dataset(swap rate(Φ)of 2%using LPL and 98%using UPL)
4.2.2 Measuring Risks
A composition attack is a situation where an intruder tries to identify an individual in the tableTby linking several available records in the microdata to an external database to exploit sensitive information,especially when the intruder has much background information about the relationship between the QI and SAs[48].Therefore,measuring disclosure risk is essentially measuring the rareness of a cell in data publishing.The methods employed for assessing risk disclosure in tableTduring a composition attack are discussed in this section.
Figure 4:Data utility on the Education dataset(swap rate(Φ)of 5%using LPL and 95%using UPL)
Data publishers should strive to measure the risk disclosure of anonymization approach outputs to ensure privacy preservation.This step is key in defining the level of protection needed.Therefore,differentiating the risk disclosure measures is important because the quantity must not depend on how the data representation method is selected.Risk disclosure can be measured by determining the proportion of the genuine matches to the total matches,as expressed in Eq.(4).
The experimental results for the Education datasets are shown in Fig.5,while that of Occupation datasets are shown in Fig.6.The experimental results represented are for disclosure risk ratio(DRR),which is known to define the confidence level of an adversary and can be used to understand the sensitive values on the Education and Occupation dataset.Amongst the approaches,the e-DP approach [36]provided the lowest privacy risks for composition.The proposed solution in [36],probabilistically generated a generalized contingency table and then added noise to the counts.However,it reduced data utility,as discussed in Section A (Data Utility Comparison) and Figs.3 and 4.
Figure 5:Privacy risk for Education dataset(k=6,l=6)
Figure 6:Privacy risk for Occupation dataset(k=6,l=6)
In addition,the hybrid [31]approach yielded a lower probability of inferring the user’s private information than the probabilistic[35],composition[37],Mondrian[46],and merging[12]approaches.The merging approach reduced the probability of composition attack on the published datasets compared with the probabilistic[35],composition[37],and Mondrian[46]approaches.The proposed work could successfully reduce the probability of composition attack on the published datasets by overcoming the unique attributes and high identical attribute values usingUPLandLPL,and providing multiple matching cells in each equivalence class,which led the protection against identity disclosure.
In Fig.7,the experimental results are summarized for disclosure risk ratio(DRR)forLPLandUPLwhenΦ={(1%,99%),(2%,98%),(5%,95%),(10%,90%),(15%,85%)}.As Fig.7 and Tabs.3 and 4 illustrate,when increasing the swap rate(Φ)inLPLor decreasing the swap rate inUPLmeans a higher the privacy but decreased data utility.In this study,the special risk ratio for the composition attack was decreased by overcoming the unique attributes and high identical attribute values by usingUPLandLPL,and providing the multiple matching cells,which confer protection from identity disclosure.Intuitively,a cell is at risk for disclosure if it can be singled out from the rest[49].
4.2.3 Aggregate Query Error
An aggregate query is a mathematical computation that involves a set of values and results in a single value expressing the significance of the data.An aggregate query aims to estimate data utility in the published datasets.Aggregate query operators are often used as‘COUNT’,‘MAX’,and‘AVERAGE’to provide key numbers representing the estimated data utility to verify the effectiveness of the proposed work [50].In the experiment,only the ‘COUNT’operator was tested in this experiment,and the query was considered in the following form:
SELECT COUNT(*)
FROM Unknown-TableT
Where vi1∈Vi1and....vidim∈Vidim and s∈Vs
wherevij (1 ≤j≤d)is the QI value for attributeaij,vij⊆dijanddijis the domain for attributeaij,sis the SA value,s⊆dsanddsis the domain for the SA.Predicate dimensiondand query selectivityselare two characteristics of a query predicate;dindicates the amount of QI in the predicate,andselindicates the number of values in eachvij,1 ≤j≤d.The size ofvij,1 ≤j≤dwas chosen at random from 0,1,...set*|dij|.Each query was run on the original table as well as those generated by the proposed work and other existing works.The original and anonymized table each hada count,with the original count denoted byorgcountand the anonymised count denoted byanzcount,whereanzcountdenotes the proposed work and other existing works,respectively.All queries were computed using Equation[50]to determine the average relative error in the anonymized dataset:
Figure 7:Experimental result for DRR for LPL and UPL when Φ={(1%,99%),(2%,98%),(5%,95%),(10%,90%),(15%,85%)}
Based on the QI selection,the relative query error was plotted on the y-axis in Fig.8.For the Mondrian,hybrid,e-DP,probabilistic,and composition approaches,the value of k was set to 6,and I-diversity was set to 6 for merging and the proposed work,with the value of(LPL=5%andUPL=95%).The relative query error was calculated on anonymized tables created by the proposed work and other existing works,and one,two,three,four,or five attributes were chosen as QI.Furthermore,for the 4.5 K Occupation dataset,all possible variations of the query were created and executed across the anonymization tables.Fig.8 depicts the relative query error,with the value on the y-axis denoting the relative percentage error and the values on the x-axis denoting different QI choices.The experimental results show that the swapping approach (proposed work) consistently outperforms generalization in answering aggregate queries.For anonymized datasets,the competing approaches show a higher relative query error.Furthermore,the experimental results show that the proposed work has a slight relative error as compared to all other approaches.Because in the case of not being able to switch attributes,then they must be generalized.
Figure 8:Aggregate query error
This study started with investigating problems associated with slicing and merging approaches that are related to the random permutation of attribute values,which is used as a way to break the association between different columns in the table.Therefore,the UL method,which confers protection by finding the unique attribute values and high identical attribute values and swapping them to decrease the attribute disclosure risk and ensure attainment of l-diverse in the published table,is proposed against composition attacks.The keyword behind that is selecting group of cells to enhance published data privacy and maintain good data utility.The results of the experiments show that the UL method could improve data utility and provide a stronger privacy preservation.In terms of data utility,the UL method achieves approximately 92.47%data utility higher than works when the percentage of swap rate is 2%usingLPLand 98%usingUPLwith Education dataset size of 4.5 K.It achieves(92.19%)when the percentage of swap rate is 5%usingLPLand 95%usingUPLwith Education dataset size of 4.5 K.Moreover,the UL method potentially reduces risk disclosure compared with other existing works.The achieved performance using our proposed method helps researchers,decision-makers,and technological experts to benefit from the published big data for extracting knowledge in many fields,such as education,healthcare.In future,the proposed work could be extended to several promising directions that may focus on speeding up the performance of UL method using parallel techniques.Moreover,the effectiveness of UL method has been tested against composition attacks,and by using Adult dataset,thus,it is important to test its performance against different attacks and by using different type of datasets.
Funding Statement:This work was supported by Postgraduate Research Grants Scheme(PGRS)with Grant No.PGRS190360.
Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.
Computers Materials&Continua2022年8期