图书情报知识 ›› 2024, Vol. 41 ›› Issue (2): 28-38,149.doi: 10.13366/j.dik.2024.02.028

• 学术聚焦 · 人工智能生成内容(AIGC)治理 • 上一篇    下一篇

在线社区中人工智能生成内容的识别方法研究

邓胜利, 汪璠, 王浩伟   

  1. 武汉大学信息管理学院,武汉,430072
  • 出版日期:2024-03-10 发布日期:2024-05-14
  • 通讯作者: 汪璠(ORCID:0000-0003-0100-0320),博士研究生,研究方向:生成式人工智能;机器学习,E-mail:1161252028@qq.com
  • 作者简介:邓胜利(ORCID:0000-0001-7489-4439),博士,教授,研究方向:用户信息行为、人本人工智能,Email:victorydc@sina.com;王浩伟(ORCID:0000-0002-5085-6574),硕士研究生,研究方向:生成式人工智能,E-mail:2018301040137@whu.edu.cn。
  • 基金资助:
    本文系国家自然科学基金项目“信息生态链视角下在线知识社区用户贡献行为评价及预测研究”(71974149)和国家社会科学基金重大项目“人本人工智能驱动的信息服务体系重构与应用研究”(22&ZD324)研究成果之一。

Identification Methods of Artificial Intelligence Generated Content in Online Communities

DENG Shengli, WANG Fan, WANG Haowei   

  1. School of Information Management, Wuhan University, Wuhan, 430072
  • Online:2024-03-10 Published:2024-05-14
  • Contact: Correspondence should be addressed to WANG Fan, E-mail:1161252028@qq.com, ORCID:0000-0003-0100-0320
  • Supported by:
    This is an outcome of the project "Research on Evaluation and Prediction of User Contribution Behavior in Online Knowledge Community from the Perspective of Information Ecological Chain"(71974149)supported by National Natural Science Foundation of China and the Major Project "Research on Reconstruction and Application of Information Service System Driven by Humanistic Artificial Intelligence"(22&ZD324)supported by National Social Science Foundation of China.

摘要: [目的/意义]生成式人工智能会对在线社区造成一定程度的AI信息污染,研究多种AIGC识别方法对防范快速进化的生成式人工智能带来的负面影响有重要意义。[研究设计/方法]首先在以新浪微博54个大类主题为主的多个在线社区平台中构建了HAC数据集,其中包含100,873条分别由人类和生成式人工智能撰写的信息;然后探究当前6个主流深度学习和7个机器学习方法是否能识别在线社区中的信息是由人类还是由生成式人工智能所撰写;最后提出了一种BEM-RCNN方法进一步提高AIGC的识别精度。[结论/发现]从构建的数据集中可以看出,生成式人工智具有强大的“类人表达”,能够模拟人类在社交媒体平台上发布和回复内容。实验结果表明,提出的方法准确度达到96.4%,能够很好地识别在线社区上的内容是由人类还是AI撰写。在精度、召回率、F1-值和准确度上均优于BERT、ERNIE、TextRNN等其他13种主流的方法,验证了其性能优势。同时,大量探究实验也证明了当前主流的机器学习方法虽然精度低于此方法,但是也能够识别部分AIGC。[创新/价值]使用多种方法去识别社交媒体上的AIGC,防范生成式人工智能对社交媒体平台造成的信息污染。

关键词: 生成式人工智能, 人工智能生成内容, 在线社区, 机器学习, AI信息污染

Abstract: [Purpose/significance] Generative artificial intelligence will cause a certain degree of AI information pollution in online communities. The various AIGC identification methods studied in this paper are of great significance to prevent the negative impact of rapidly evolving generated artificial intelligence. [Design/Methodology] This paper first constructed a HAC data set in multiple online community platforms with 54 major categories of topics of Sina Weibo, which contained 100,873 pieces of information written by humans and generated artificial intelligence respectively. Then it explored whether the current 6 kinds of mainstream deep learning and 7 kinds of machine learning methods can identify whether the information in the online community was written by human beings or generated by artificial intelligence. Finally, the BEM-RCNN method was proposed to further improve the recognition of AIGC precision. [Findings/Conclusion] From the perspective of constructed data set, it is found that generated artificial intelligence has a strong "human-like expression", which can simulate human beings to post and reply on social media platforms. The experimental results show that the method proposed in this paper has an accuracy of 96.4%, which can well identify whether the content on the online community is written by humans or AI. It is superior to the 13 other mainstream methods such as BERT, ERNIE, and TextRNN in terms of precision, recall rate, F1-value, and accuracy, verifying its performance advantages. At the same time, many exploratory experiments have also proved that although the current mainstream machine learning methods are less accurate than the method in this paper, they can also be competent for some AIGC recognition tasks. [Originality/Value] Multiple methods are used in this paper to identify AIGC on social media, and prevent information pollution caused by generative artificial intelligence on social media platforms.

Key words: Generative artificial intelligence, Artificial Intelligence Generated Content(AIGC), Online communities, Machine learning, AI information pollution