图书情报知识 ›› 2025, Vol. 42 ›› Issue (6): 6-15,27.doi: 10.13366/j.dik.2025.06.006

• 特约稿 • 上一篇    下一篇

哲学社会科学高质量数据集的核心特征、应用需求与建设进路

徐拥军1,2, 张群群1, 傅予1,2, 成徐慧1   

  1. 1.中国人民大学信息资源管理学院,北京,100872;
    2.中国人民大学书报资料中心,北京,100872
  • 出版日期:2025-11-10 发布日期:2026-01-17
  • 通讯作者: 张群群(ORCID: 0009-0002-6054-4096),博士研究生,研究方向:档案学基础理论、学术交流,Email: zhangqq@ruc.edu.cn。
  • 作者简介:徐拥军(ORCID: 0000-0002-1180-7358),博士,教授,研究方向:档案学基础理论、学术出版与学术评价,Email: xyj@ruc.edu.cn;傅予(ORCID: 0000-0002-9097-7477),博士,副教授,研究方向:数据管理、学术评价,Email: yu.fu@ruc.edu.cn;成徐慧(ORCID: 0009-0008-5841-1132),硕士研究生,研究方向:档案学基础理论,Email: 2025103686@ruc.edu.cn。
  • 基金资助:
    本文系国家社会科学基金项目“基于智能计算的红色档案文化传播信息采纳机制研究”(24BTQ015)的研究成果之一。

The Core Characteristics, Application Requirements and Construction Approaches of High-Quality Datasets in Philosophy and Social Sciences

XU Yongjun1,2, ZHANG Qunqun1, FU Yu1,2, CHENG Xuhui1   

  1. 1. School of Information Resource Management, Renmin University of China, Beijing,100872;
    2.Information Center for Social Science, Renmin University of China, Beijing, 100872
  • Online:2025-11-10 Published:2026-01-17
  • Contact: Correspondence should be addressed to ZHANG Qunqun, Email: zhangqq@ruc.edu.cn, ORCID: 0009-0002-6054-4096
  • Supported by:
    This is an outcome of the project "Research on the Adoption Mechanism of Red Archive Culture Dissemination Information Based on Intelligent Computing"(24BTQ015)supported by National Social Science Foundation of China.

摘要: [目的/意义] 针对当前中文语料供给不足、标注质量不高等问题,聚焦哲学社会科学领域高质量数据集建设,对其核心特征、应用需求与建设进路展开系统研究,旨在提升数据供给能力与场景适配效能。[研究设计/方法] 从“供给侧”视角,构建“数据单元—数据集合”双层多维体系,解析哲学社会科学高质量数据集的核心特征;从“需求侧”视角,提出“基础认知—场景理解—行动规划”三层应用需求框架,明确不同层级模型能力对数据内容与质量的差异化要求。设计覆盖数据需求—规划—采集—预处理—标注—模型验证的全生命周期建设方法论。[结论/发现] 哲学社会科学数据集在规范性、原创性、代表性、可追溯性等维度具有天然优势,但在准确性、多样性、一致性与相关性、可复用性等维度存在痛点难点。立足其核心特征,面向应用需求,按照数据生命周期方法论进行系统构建,有望实现“高质量”建设目标。[创新/价值] 研究结果可为相关高质量数据集建设提供理论依据和实践参考。

关键词: 哲学社会科学, 高质量数据集, 人工智能, 数据供给, 数据要素

Abstract: [Purpose/Significance] Aiming at the current problems of insufficient supply of Chinese corpus and low quality of annotation, this study focuses on the construction of high-quality datasets in the field of Philosophy and Social Sciences, and systematically studies their core characteristics, application requirements and construction approaches, so as to improve the capacity of data supply and the efficiency of scene adaptation. [Design/Methodology] From the perspective of "supply side", this paper constructs a double-layer and multi-dimensional system of "data unit-data collections", and analyzes the core characteristics of the high-quality datasets in Philosophy and Social Sciences. From the perspective of "demand side", a three-tier application demand framework of "basic cognition-scene understanding - action planning" is proposed, clearly defining the differentiated requirements of different levels of model capabilities on data content and quality. This study also designs a full life cycle construction methodology covering data requirements, planning, collection, preprocessing, annotation, and model verification. [Findings/Conclusion] The datasets in Philosophy and Social Sciences have natural advantages in the dimensions of standardization, originality, representativeness and traceability, but there are pain points and difficulties in the dimensions of accuracy, diversity, consistency and relevance, and reusability. If these datasets are systemly constructed based on their core characteristics, facing application needs, and following the data life cycle methodology, it is expected to achieve the goal of "high quality" construction. [Originality/Value] The research results can provide theoretical basis and practical reference for the construction of relevant high-quality datasets.

Keywords: Philosophy and Social Sciences, High-quality datasets, Artificial intelligence, Data supply, Data elements