数据提取须有德

2022-03-07 13:39朱利叶斯切尔尼奥斯卡斯云天
英语世界 2022年10期
关键词:爬虫网页代理

文/朱利叶斯·切尔尼奥斯卡斯 译/云天

近来,互联网正经历着与18 世纪早期“采金热”类似的现象,特别是在数据提取方面。数据因其巨大的价值而被某些分析师称为“新石油”。数据领域仍然对大大小小的参与者开放,但这也导致了若干不专业的行为,甚至有人设法获取有密码保护的数据。

2尽管许多网站确实包含IP禁令等防御措施,但由于竞争加剧和各种经济因素,网络爬虫和服务器之间的无形冲突仍在持续,并愈演愈烈。尽管大多数人很乐意利用亿客行、谷歌购物、PriceGrabber 和天巡网等聚合网站的低价优势,但人们并没有意识到上述冲突正发生在不同的电商平台之间。

符合道德的网页数据抓取:目的的重要性

3使用工具的目的有好有坏,网页数据抓取也不例外。一种相当常见的情况是以营销为目的抓取个人数据。数亿用户通过电商平台上的服务协议条款同意公开他们的数据,无论他们是否意识到了这一操作。然而,数据遭泄露的问题在于,这些数据由社交媒体机构提取,却为僵尸网站所用。这类网站在未经用户许可的情况下创建个人资料,并罗列出个人的详细信息。

4结果,网页数据抓取的负面新闻越来越多,这使得公众对自身数据价值和隐私的认识有所提高。网页数据抓取本身并没有什么不道德的,因为它不过是把人们通常需要手动操作的活动自动化了。主要的区别在于,网页数据抓取使用机器人程序,在极短时间内爬取大量网站、提取海量信息,从而实现更大规模的信息搜集。

5提取公开的数据需要代理。简单来说,代理是网络爬虫和服务器之间的中介。使用代理可以将数据请求均匀地分配到服务器,这样能确保以合理的速率请求数据,也可保证请求方匿名。

不道德抓取的后果

6不道德抓取所采用的数据提取方式可能损害个人隐私,导致服务器过载。

7尽管很多网站试图通过IP禁令来防止不道德抓取,但这渐渐变得徒劳,因为使用了代理,而且这些代理能够模拟人类行为来规避服务器问题。这最终可能导致服务器过载(使在线企业耗费资金)、互联网透明度降低、公众在隐私问题上的不信任加重。

网页数据抓取道德规范是必要的

8网页数据抓取大有裨益,但这有赖于有自由且透明的互联网可用。我确信,如果我们能遵循一些准则,使局面对每个人都公平,那么网页数据抓取将有益于整个科技领域:

1. 只抓取公开的网页

2. 研究目标网站的法律文件以确定你依照法律是否接受其服务条款。如果接受,确定自己是否不会违背

3. 合理请求数据以保证服务器功能不受损害(DDoS 攻击)

4. 尊重源网站对所获得的任何数据的隐私保护

5. 使用以合乎道德的手段获取的代理

并非所有代理都是平等的

9众所周知,当今正在运行的某些代理,其获取方式并不道德。许多代理通常是人们从下载到个人设备里的应用程序中获取的。很难确定这些用户是否意识到了他们的设备正在被使用。但可以肯定的是,如果用户同意了具有误导性或是容易混淆的服务条款,从而不情愿地将个人设备变成住宅代理网络中的参与者,那么将这类程序用作代理一定是不道德的。

合乎道德的做法能提升公平性与责任心

10现代网页数据抓取的某些方面缺乏明确性,需要道德规范来为行业带来秩序。如果业内人士能够就专业的网页数据抓取方法达成共识,这将有助于维护一个公平、开放、自由的网络环境,使企业与消费者双赢。关于数据抓取在各行各业所能发挥的最大潜能,我们对此的了解仍处在早期阶段,所以让我们抓住这个大好时机,以最合乎道德的方式来推动创新、促进发展。 □

The internet is currently undergoing a similar phenomenon to the gold rushes of the early eighteenth century,specifically when it comes to data extraction. With data now dubbed by some analysts as the “new oil” in terms of its value, the field is still open to small and large players alike, which has led to some unprofessional activities that extend all the way towards the acquisition of password-protected data.

2While many websites do contain defensive measures such as IP bans, the invisible conflicts between scrapers1scraper 网络爬虫,一种按照一定的规则,自动抓取万维网信息的程序或脚本。后文的抓取、爬取,均指从万维网上收集数据。and servers are ongoing and gaining in intensity, due to increased competition and economic factors. Most people don’t realise these are taking place between e-commerce stores, although they are happily taking advantage of the low prices found on aggregator websites2aggregator website 聚合网站,指的是通过人为技术方式收集其他网站的热点内容,进而将相关链接内容分类聚合成为自己网站内容的网站。

2 aggregator website 聚合网站,指的是通过人为技术方式收集其他网站的热点内容,进而将相关链接内容分类聚合成为自己网站内容的网站。like Expedia, Google Shopping, Price-Grabber and Skyscanner.

Ethical web scraping: the importance of intention

3Tools can be used for positive and negative purposes, and web scraping is no exception. A fairly common scenario is the scraping of personal data for marketing purposes. Hundreds of millions of users agree to release their data through terms of service agreements on e-commerce sites—whether they realise it or not. The issue with the exposed data, however, is that it has been extracted by social media agencies and used by now-defunct websites that create profiles and list personal details without user permission.

4As a result, web scraping is increasingly being subjected to negative press that has resulted in increased awareness from the public with respect to the value and privacy of their data. There is nothing inherently unethical about web scraping as it automates activities that people often do on a manual basis. The main difference is that web scraping does it on a much bigger scale by using bots to crawl numerous websites and extract huge amounts of information in seconds.

5Extracting publicly available data requires proxies3proxy 代理,一种特殊的网络服务。它允许客户端通过这个服务与服务器进行连接。. In short, proxies act as intermediaries between the web scraper and web server. Employing proxies allows distributing data requests evenly to the web server, ensuring that the data is requested at a fair rate, as well as providing the anonymity factor to the requesting party.

The consequences of unethical scraping

6Unethical scraping uses data extraction in a way that may compromise4compromise 危及,损害。privacy and result in server overload.

7While many websites try to prevent it through IP bans, this is becoming futile5futile 徒劳的。due to the use of proxies and their function in circumventing66 circumvent 逃避(规则或限制)。server issues by simulating human behaviour. The end results can be server overloads that cost online businesses money, reduced internet transparency and more distrust from the public with respect to privacy issues.

A web scraping code of ethics is necessary

8Web scraping has many benefits that depend upon the availability of a free and transparent internet. I believe it would benefit the entire tech space if we adopted a few guidelines in order to make the landscape fair for everyone:

1. Scrape publicly available web pages only

2. Study the target website’s legal documents to determine whether you will legally accept their terms of service and if you will do so—whether you will not breach these terms

3. Make reasonable requests for data in order to ensure that server function is not compromised (DDoS attack7DDoS attack 即distributed denial-of-service attack,分散式阻断服务攻击,一种网络攻击手法。该手法的目的在于将目标电脑的网络资源及系统资源耗尽,待目标电脑负荷过重而倒下后,通过系统漏洞入侵目标电脑。)

4. Respect privacy concerns of source websites with regards to any data obtained

5. Make use of proxies procured in an ethical manner

Not all proxies are equal

9It is commonly known that some proxies operating today are not ethically sourced, with many often obtained through applications downloaded by people on their devices. Whether these individuals are aware that their device is being used is difficult to ascertain.What’s certain is that it’s definitely not ethical to use them as a proxy in cases where they consented to misleading or confusing terms of service that unwillingly turn their device into a participant on a residential proxy network.

Ethical practices lead to increased fairness and accountability

10There are some aspects of modern web scraping activity that are missing clarity, and a code of ethics is needed to bring order to the industry. If those in the industry can come together in agreement over a professional approach to web scraping, it will help to maintain a fair, open and free internet that will benefit both businesses and consumers. We are still in the early stages of discovering the full potential of data scraping in different industries, so let’s take advantage of this golden opportunity to drive innovation and create growth in the most ethical way possible. ■

猜你喜欢
爬虫网页代理
利用网络爬虫技术验证房地产灰犀牛之说
基于Python的网络爬虫和反爬虫技术研究
基于HTML5与CSS3的网页设计技术研究
代理圣诞老人
代理手金宝 生意特别好
基于CSS的网页导航栏的设计
基于HTML5静态网页设计
大数据背景下校园舆情的爬虫应用研究
基于URL和网页类型的网页信息采集研究
大数据环境下基于python的网络爬虫技术