Documentation, Informaiton & Knowledge ›› 2025, Vol. 42 ›› Issue (6): 6-15,27.doi: 10.13366/j.dik.2025.06.006

• Papers on Invitation • Previous Articles     Next Articles

The Core Characteristics, Application Requirements and Construction Approaches of High-Quality Datasets in Philosophy and Social Sciences

XU Yongjun1,2, ZHANG Qunqun1, FU Yu1,2, CHENG Xuhui1   

  1. 1. School of Information Resource Management, Renmin University of China, Beijing,100872;
    2.Information Center for Social Science, Renmin University of China, Beijing, 100872
  • Online:2025-11-10 Published:2026-01-17
  • Contact: Correspondence should be addressed to ZHANG Qunqun, Email: zhangqq@ruc.edu.cn, ORCID: 0009-0002-6054-4096
  • Supported by:
    This is an outcome of the project "Research on the Adoption Mechanism of Red Archive Culture Dissemination Information Based on Intelligent Computing"(24BTQ015)supported by National Social Science Foundation of China.

Abstract: [Purpose/Significance] Aiming at the current problems of insufficient supply of Chinese corpus and low quality of annotation, this study focuses on the construction of high-quality datasets in the field of Philosophy and Social Sciences, and systematically studies their core characteristics, application requirements and construction approaches, so as to improve the capacity of data supply and the efficiency of scene adaptation. [Design/Methodology] From the perspective of "supply side", this paper constructs a double-layer and multi-dimensional system of "data unit-data collections", and analyzes the core characteristics of the high-quality datasets in Philosophy and Social Sciences. From the perspective of "demand side", a three-tier application demand framework of "basic cognition-scene understanding - action planning" is proposed, clearly defining the differentiated requirements of different levels of model capabilities on data content and quality. This study also designs a full life cycle construction methodology covering data requirements, planning, collection, preprocessing, annotation, and model verification. [Findings/Conclusion] The datasets in Philosophy and Social Sciences have natural advantages in the dimensions of standardization, originality, representativeness and traceability, but there are pain points and difficulties in the dimensions of accuracy, diversity, consistency and relevance, and reusability. If these datasets are systemly constructed based on their core characteristics, facing application needs, and following the data life cycle methodology, it is expected to achieve the goal of "high quality" construction. [Originality/Value] The research results can provide theoretical basis and practical reference for the construction of relevant high-quality datasets.

Keywords: Philosophy and Social Sciences, High-quality datasets, Artificial intelligence, Data supply, Data elements