图书情报知识 ›› 2023, Vol. 40 ›› Issue (5): 39-49.doi: 10.13366/j.dik.2023.05.039

• 专题·AI 时代的数字学术基础设施 • 上一篇    下一篇

通往AI时代的科研文献数据集:特征规律与发展方向

张彤阳1, 王楚涵2, 俞超1, 徐健1   

  1. 1.中山大学信息管理学院,广州,510006;
    2.北京大学信息管理系,北京,100091
  • 出版日期:2023-09-10 发布日期:2023-10-22
  • 通讯作者: 徐健(ORCID:0000-0003-4886-4708),博士,教授,研究方向:科学计量、网络数据挖掘、网络用户情感分析、跨学科交流分析, Email:issxj@mail.sysu.edu.cn。
  • 作者简介:张彤阳(ORCID:0000-0003-3538-8343),博士研究生,研究方向:科学计量、语义分析、跨学科交流分析,Email:zhangty65@mail2.sysu.edu.cn;王楚涵(ORCID:0000-0001-9880 -5613),博士研究生,研究方向: 科学计量、知识计算、科研人员评价, Email:wangchuhan@stu.pku.edu.cn;俞超(ORCID:0000-0002-3195-1357), 博士研究生,研究方向:科学计量、数据挖掘, Email:yuch25@mail3.sysu.edu.cn。
  • 基金资助:
    本文系广州市哲学社会科学发展“十四五”规划 2023 年度一般课题项目“人才是第一资源的广州实践和对策研究—高校科技人才与企业技术需求对接问题与对策研究”(2023GZYB04)的研究成果之一。

Academic Literature Data Sets Towards the AI Era: Characteristics and Development Direction

ZHANG Tongyang1, WANG Chuhan2, YU Chao1, XU Jian1   

  1. 1. School of Information Management, Sun Yat-sen University, Guangzhou,510006;
    2. Department of Information Management,Beijing University, Beijing, 100091
  • Online:2023-09-10 Published:2023-10-22
  • Contact: Correspondence should be addressed to XU Jian, Email:issxj@mail.sysu.edu.cn,ORCID:0000-0003-4886-4708
  • Supported by:
    This is an outcome of the project "Practice and Countermeasures of Talentsas the First Resource in Guangzhou–Research on Matching the Enterprise Technical Demands and Talents in Universities"(2023GZYB04)supported by a grant from the 2023 General Program of the Philosophy and Social Sciences Development of Guangzhou during the 14th Five-Year Plan period.

摘要: [目的/意义]人工智能技术的更迭应用驱动着数据集在科学计量研究领域发挥着日渐重要的作用。从传统面向信息供应的数据资料集合到如今辅助知识发现和关系网络构建的知识资源,数据集各功能取得了快速发展,进而为拓展科学计量研究的深度和广度提供支持。[研究设计/方法]以Scientometrics期刊2016-2020年收录的论文为数据源,分析数据集的整体使用情况,探究数据集的使用热度与文献数量之间的关系,针对典型数据集进行特征分析,并探讨人工智能技术对于数据集工作的影响,展望数据集的未来建设方向。[结论/发现] 数据集的被使用频次与其收录论文数量之间存在一定正相关关系,同一科学计量研究倾向于同时使用多种数据集,且基于科研文献数据集的科学计量研究与人工智能技术之间的关系日益紧密。[创新/价值]旨在通过分析科学计量相关论文所使用数据集的特征,总结归纳近年来数据集的建设发展规律,并为开展科学计量研究选用数据集提供参考。

关键词: 数据集, 人工智能, 科学计量, 特征分析

Abstract: [Purpose/Significance] Driven by the application development of artificial intelligence technologies, academic literature datasets are playing an increasingly important role in the field of scientometrics. Developing from traditional data collections oriented to information supply to knowledge resources that assist in knowledge discovery and relationship network construction nowadays, every aspect of the function of data sets have been greatly improved, which in turn provides support for expanding the depth and breadth of research in scientometrics. [Design/Methodology] The article uses journal articles published by Scientometrics in 2016-2020 as the data source. Through the analysis of the data sets usage records of scientometrics research, the overall use of the data sets is summarized, and the relationship between the usage popularity of the data sets and the number of documents is explored. In terms of typical data sets that have been frequently used, the article specifically analyzes their characteristics, explores the impact of artificial intelligence technology on datasets and looks forward to its future construction direction. [Findings / Conclusion] There is a certain positive correlation between the usage frequency of datasets and their collection volume, the same study tends to cross use multiple datasets, and the relationship between scientometrics research and artificial intelligence technology is getting increasingly close. [Originality/Value] The purpose of this study is to summarize the construction and development rules of data sets in recent years by analyzing the striking features of data sets used in scientometrics related papers, so as to provide references for the selection of data sets for scientometrics research.

Keywords: Data sets, Artificial intelligence, Scientometrics, Feature analysis