劳拉·哈努等 郭晓阳
Machine-learning systems could help flag hateful, threatening or offensive language. 機器学习系统可帮助标记仇恨性、威胁性或攻击性言论。
Social platforms large and small are struggling to keep their communities safe from hate speech, extremist content, harassment and misinformation. One solution might be AI: developing algorithms to detect and alert us to toxic and inflammatory comments and flag them for removal. But such systems face big challenges.
The prevalence of hateful or offensive language online has been growing rapidly in recent years. Social media platforms, relying on thousands of human reviewers, are struggling to moderate the ever-increasing volume of harmful content. In 2019, it was reported that Facebook moderators are at risk of suffering from PTSD as a result of repeated exposure to such distressing content. Outsourcing this work to machine learning can help manage the rising volumes of harmful content. Indeed, many tech giants have been incorporating algorithms into their content moderation1 for years.
One such example is Googles Jigsaw2, a company focusing on making the internet safer. In 2017, it helped create Conversation AI, a collaborative research project aiming to detect toxic comments online. However, a tool produced by that project, called Perspective, faced substantial criticism. One common complaint was that it created a general “toxicity score” that wasnt flexible enough to serve the varying needs of different platforms. Some Web sites, for instance, might require detection of threats but not profanity, while others might have the opposite requirements.
Another issue was that the algorithm learned to conflate toxic comments with nontoxic comments that contained words related to gender, sexual orientation, religion or disability. For example, one user reported that simple neutral sentences such as “I am a gay black woman” or “I am a woman who is deaf” resulted in high toxicity scores, while “I am a man” resulted in a low score.
Following these concerns, the Conversation AI team invited developers to train their own toxicity-detection algorithms and enter them into three competitions (one per year) hosted on Kaggle, a Google subsidiary known for its community of machine learning practition-ers, public data sets and challenges. To help train the AI models, Conversation AI released two public data sets containing over one million toxic and non-toxic comments from Wikipedia and a service called Civil Comments. Some comments were seen by many more than 10 annotators (up to thousands), due to sampling and strategies used to enforce rater accuracy.
The goal of the first Jigsaw challenge was to build a multilabel toxic comment classification model with labels such as “toxic”, “severe toxic”, “threat”, “insult”, “obscene”, and “identity hate”. The second and third challenges focused on more specific limitations of their API: minimizing unintended bias towards pre-defined identity groups and training multilingual models on English-only data.
Our team at Unitary, a content-moderation AI company, took inspir-ation from the best Kaggle solutions and released three different models corresponding to each of the three Jigsaw challenges. While the top Kaggle solutions for each challenge use model ensembles, which average the scores of multiple trained models, we obtained a similar performance with only one model per challenge.
While these models perform well in a lot of cases, it is important to also note their limitations. First, these models will work well on examples that are similar to the data they have been trained on. But they are likely to fail if faced with unfamiliar examples of toxic language.
Furthermore, we noticed that the inclusion of insults or profanity in a text comment will almost always result in a high toxicity score, regardless of the intent or tone of the author. As an example, the sentence “I am tired of writing this stupid essay” will give a toxicity score of 99.7 percent, while removing the word “stupid” will change the score to 0.05 percent.
Lastly, all three models are still likely to exhibit some bias, which can pose ethical concerns when used off-the-shelf3 to moderate content.
Although there has been considerable progress on automatic detection of toxic speech, we still have a long way to go until models can capture the actual, nuanced, meaning behind our language—beyond the simple memorization of particular words or phrases. Of course, investing in better and more representative datasets would yield incremental improvements, but we must go a step further and begin to interpret data in context, a crucial part of understanding online behavior. A seemingly benign text post on social media accompanied by racist symbolism in an image or video would be easily missed if we only looked at the text. We know that lack of context can often be the cause of our own human misjudgments. If AI is to stand a chance of replacing manual effort on a large scale, it is imperative that we give our models the full picture.
大大小小的社交平臺都在竭力保障用户远离仇恨言论、极端内容、网络骚扰及错误信息。人工智能或可成为一种解决方案:开发算法来检测恶意和煽动性言论,打上删除标记,并向我们发出警告。但此类系统面临重大挑战。
近年来,网上的仇恨言论或攻击性语言激增。社交媒体平台依靠数千名人工审核员,难以审核持续增长的有害内容。据报道,2019年,脸书公司的审核员由于反复接触此类令人痛苦的内容,面临罹患创伤后应激障碍的风险。把这项工作交由机器学习完成,有助于解决有害内容数量不断攀升的问题。事实上,近年来,许多大型科技公司已经把算法集成到内容审核中。
谷歌旗下的Jigsaw公司即为一例。Jigsaw是一家专注于提升互联网安全性的公司。2017年,它帮助创建了Conversation AI。这是一个旨在检测网上恶意评论的合作研究项目。然而,这个项目推出的一款名为Perspective的工具却遭到广泛批评。一条常见的投诉意见是,此工具生成的综合“恶意评分”不够灵活,无法满足不同平台的各种需求。例如,有些网站可能需要检测威胁言论,而非不雅语言,而另一些网站的需求可能正好相反。
另一个问题是,算法学习将恶意评论与含有性别、性取向、宗教信仰或残障相关字眼的非恶意评论混为一谈。例如,一位用户报告称,诸如“我是一名同性恋黑人女性”或“我是一名耳聋女性”等中性句会得到高恶意评分,而“我是个男人”的恶意评分却很低。
为回应这些关切,Conver-sation AI团队邀请开发者训练自己的恶意检测算法,并参加在Kaggle平台举办的三项算法竞赛(每年一项)——Kaggle是谷歌公司的子公司,以旗下的机器学习从业者社区、公共数據集和挑战赛而闻名。为帮助训练人工智能模型,Conversation AI公布了两个公共数据集——包含一百余万条来自维基百科的恶意和非恶意评论,以及一个名为“文明评论”的服务。由于采样和为加强评分者准确率所采用的策略等原因,部分评论由远超十名(最多数千名)的注释者审阅。
Jigsaw公司第一个挑战的目标是创建一个多标签恶意评论分类模型,其标签包含“恶意”“严重恶意”“威胁”“侮辱”“淫秽”“身份仇恨”等。第二及第三个挑战则专注于解决更加具体的API限制:最大限度减少对预定义身份群体的无意识偏见,以及训练纯英语数据的多语言模型。
Unitary公司是一家内容审核人工智能公司,我们在Unitary的团队从最优秀的Kaggle方案中得到启发,公布了三种不同的模型,分别对应Jigsaw公司的三项挑战。每项挑战的顶级Kaggle方案均采用模型集成,对多个训练好的模型分数取平均值,而我们每项挑战只使用一个模型,就取得了相似的表现。
这些模型在许多情况下表现良好,但也要注意到它们的局限性。首先,这些模型在样例与训练数据近似时,会有良好表现。但若处理不熟悉的恶意语言样例,很可能失效。
此外,我们注意到,在文本评论中包含侮辱性或不雅语言,几乎总会得到高恶意评分,不管作者的意图或语气如何。例如,“我不想再写这篇讨厌的文章了”这句话会得到99.7%的恶意评分,而把“讨厌的”这个词删除,评分会变为0.05%。
最后一点,这三个模型仍有很大可能展现出某种偏见,若直接用于内容审查,可能引发伦理问题。
虽然恶意言论自动识别技术已取得长足进步,但要超越只记忆特定单词或短语,发展到模型能捕捉语言背后实际的、微妙的含义,我们还有很长的路要走。当然,投入构建更好、更具代表性的数据集会带来渐进性改善,但我们必须更进一步,着手在语境下解读数据,这是理解网上行为的关键一环。发布在社交媒体上的一段文字表面似乎无害,但附带的图片或视频中含有种族歧视符号,倘若我们只关注文字本身,就很容易将其漏掉。我们知道,缺乏语境经常导致人类产生误解。人工智能若要大量代替人力操作,我们必须给模型提供全景信息。
(译者为“《英语世界》杯”翻译大赛获奖者)
1 content moderation内容审核,是基于图像、文本、视频的检测技术,可自动检测涉黄、广告、涉政、涉暴、涉及敏感人物等内容,对用户上传的图片、文字、视频进行内容审核,帮助客户降低业务违规风险。 2由谷歌建立的一家技术孵化公司(其前身为谷歌智库部门Google Ideas),主要负责创建技术工具来减少并遏制线上虚假信息、骚扰以及其他问题。
3 off the shelf(产品)现成的,不需定制的。文中充当副词,用作状语。