基于经验小波变换的基因关联隐私保护实验研究

2020-04-17 08:54陈红松孟彩霞刘书雨
湖南大学学报·自然科学版 2020年2期
关键词:隐私保护

陈红松 孟彩霞 刘书雨

摘   要:为了解决某类风湿性关节炎与致病基因单核苷酸多态性(Single-Nucleotide Polymorphism,SNP)的相关度研究中,针对病人隐私保护强度与数据可用性的权衡问题,提出一种新型的基于经验小波变换(Empirical Wavelet Transform,EWT)的隐私保护方法. 该方法通过对差分隐私加噪机制产生的数据进行EWT变换和分解,然后计算各EWT分量的峭度值并筛选出可能的噪声分量,去除一定的噪声分量后对信号进行重构得到新数据,基于该数据进行致病基因相关度排序. 实验结果表明使用该方法能在保证差分隐私保护强度的情况下提高数据可用性,实现了隐私保护强度与数据可用性的合理权衡.

关键词:隐私保护;经验小波变换;差分隐私;相关度;数据可用性

中图分类号:TP309                         文献标志码:A

Abstract:Due to privacy concerns in the genome-wide association studies of rheumatoid arthritis,there has been applying differential privacy to protect phenotype information (disease status) from being leaked while returning highly associated SNP(Single-Nucleotide Polymorphism). The trade-off between privacy protection intensity and data availability is a great problem. In order to solve the problem,a novel differential privacy protection method based on EWT (Empirical Wavelet Transform) was proposed. This method achieved the balance between privacy protection intensity and data availability by processing the noise introduced by differential privacy. Firstly,the data with differential privacy noise mechanism was processed by EWT approach; secondly,the kurtosis values of each EWT component were calculated,then some account of noise components was filtered out. At last,the data was reconstructed. After the above steps,the new data was obtained; it would be sorted according to the correlation degree of pathogenic genes. The experimental results show that the novel method can improve the data availability while ensuring the differential privacy protection intensity,and achieve a reasonable trade-off between the privacy protection intensity and the data availability.

Key words:privacy protection;Empirical Wavelet Transform(EWT);differential privacy;association degree;data availability

致病基因關联分析是全基因组关联研究[1](Genome-wide Association Studies,GWAS)中的一项分析DNA序列集以发现疾病遗传基础的流行方法,这项研究主要检查特定患者群体的基因中数千个单核苷酸多态性位点(SNP)与疾病之间的相关度,对SNP进行评分,并根据这个评分对相关度较高的SNP排序. 但对于GWAS发布的数据而言,即使是只发布统计数据,患者的疾病状态也可以从每个SNP与疾病相关联的统计检验中推断出来,这使得患者的隐私面临着泄露的风险.

目前已有许多研究人员研究使用差分隐私技术来解决这一问题,差分隐私保护技术是当前数据发布中最主要的隐私保护方法,它通过向查询数据中添加噪声来干扰攻击者泄露原始数据的目的,从而达到隐私保护效果. 差分隐私保护技术的应用使得数据发布的效率得到了很大的提高,但为了满足差分隐私保护要求需要注入过高的噪声,影响数据的正确性和可用性,最终导致数据效用降低. 为了解决这一问题,本文提出了一种基于EWT变换的差分隐私保护方法,不仅依赖于注入噪声,还通过适当过滤部分噪声实现隐私保护与数据可用性的合理折中,由于只是针对噪声的注入、变换和过滤,所以不会还原出用户隐私信息. 主要研究目的是在致病基因相关度研究中,使用差分隐私保护患者隐私的同时,降低由于添加差分隐私噪声带来的误差.

1   相关技术

1.1   差分隐私

1.1.1   定   义

差分隐私的主要思想是给数据集中的每条记录都添加一个噪声,使在一个数据集上计算的给定统计量的结果类似于在另一个任意的数据集上计算的相同的统计量,以此来把数据泄露的概率控制在一定的范围内,从而达到隐私保护的目的. 满足以上两个数据集中最多只有一条记录不同,即如果一个数据库是另一个数据库的正确子集,那么较大的数据库只比另一个多包含一行数据.

4   结   论

针对在致病基因相关度排序实验中数据因添加差分隐私噪声而导致的数据可用性较低这一问题,本文提出了一种基于EWT变换的差分隐私保护方法,设计了实现步骤并通过实验验证了该方法的可行性和正确性. 实验结果表明,该方法在保证了差分隐私保护强度的条件下,能够较为显著地提高致病基因相关度排序数据的可用性和准确度,实现了数据隐私保护强度与可用性的有效权衡. 下一步将继续研究如何在保证算法准确度的情况下降低隐私保护算法的时间复杂度.

参考文献

[1]    JOHNSON A,SHMATIKOV V. Privacy-preserving data exploration in genome-wide association studies[C]// Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Chicago:ACM,2013:1079—1087.

[2]    何贤芒,王晓阳,陈华辉,等. 差分隐私保护参数ε的选取研究[J]. 通信学报,2015,36(12):124—130.

HE X M,WANG X Y,CHEN H H,et al. Study on choosing the parameter ε in differential privacy[J]. Journal on Communications,2015,36(12):124—130. (In Chinese)

[3]    HAN Z,LIU H,WU Z. A differential privacy preserving framework with nash equilibrium in genome-wide association studies[C]// 2018 International Conference on Networking and Network Applications (NaNA). Xian:IEEE,2018:91—96.

[4]    熊平,朱天清,王晓峰. 差分隐私保护及其应用[J]. 计算机学报,2014,37(1):101—122.

XIONG P,ZHU T Q,WANG X F. A survey on differential privacy and applications [J].Chinese Journal of Computers,2014,37(1):101—122.(In Chinese)

[5]    张啸剑,孟小峰. 面向数据发布和分析的差分隐私保护[J]. 计算机学报,2014,37 (4):927—949.

ZHANG X J,MENG X F. Differential privacy in data publication and analysis[J]. Chinese Journal of Computers,2014,37(4):927—949.(In Chinese)

[6]   GILLES J. Empirical wavelet transform[J]. IEEE Transactions on Signal Processing,2013,61(16):3999—4010.

[7]   SIMMONS S,SAHINALP C,BERGER B. Enabling privacy-preserving GWASs in heterogeneous human populations[J]. Cell Systems,2016,3(1):54—61.

[8]   DISHABI M R E,AZGOMI M A. Differential privacy preserving clustering using Daubechies-2 wavelet transform[J]. International Journal of Wavelets,Multiresolution and Information Processing,2015,13(4):1550028.

[9]    劉春,谢皓,肖奕霖,等. EWT算法在ECG信号滤波中的研究[J]. 电子测量与仪器学报,2017,31(11):1835—1842.

LIU C,XIE H,XIAO Y L,et al. Research on empirical wavelet transform algorithm in ECG signal filtering[J]. Journal of Electronic Measurement and Instrument,2017,31(11):1835—1842.(In Chinese)

[10]  SADAT M N,AZIZ A,MOMIN M,et al. SAFETY:Secure GWAS in federated environment through a hybrid solution[J]. IEEE/ACM Transactions on Computational Biology and Bioinformatics,2019,16(1):93—102.

[11] WAN Z,VOROBEYCHIK Y,XIA W,et al. Expanding access to large-scale genomic data while promoting privacy:A game theoretic approach[J]. The American Journal of Human Genetics,2017,100(2):316—322.

[12]  李泽军,陈敏,曾利军. 一种分析全基因组上位性的新方法 [J]. 湖南大学学报(自然科学版),2016,43(10):160—165

LI Z J,CHEN M,ZENG L J. A genome-wide epistasis analysis method based on multiple criteria fusion[J]. Journal of Hunan University(Natural Sciences),2016,43(10):160—165.(In Chinese)

[13]  ZHANG P,SUN X,WANG T,et al. An accelerated fully homomorphic encryption scheme over the integers[C]//2016 4th International Conference on Cloud Computing and Intelligence Systems (CCIS). Beijing:IEEE,2016:419—423.

猜你喜欢
隐私保护
移动商务消费行为分析研究
适用于社交网络的隐私保护兴趣度匹配方案
可搜索加密在云计算移动学习中的应用
基于层次和节点功率控制的源位置隐私保护策略研究
关联规则隐藏算法综述
大数据环境下用户信息隐私泄露成因分析和保护对策
大数据安全与隐私保护的必要性及措施
大数据时代中美保护个人隐私的对比研究
社交网络中的隐私关注及隐私保护研究综述
大数据时代的隐私保护关键技术研究