图书情报知识 ›› 2016, Vol. 0 ›› Issue (3): 80-88.doi: 10.13366/j.dik.2016.03.080

• 情报、信息与共享 • 上一篇    下一篇

微博话题识别中基于动态共词网络的文本特征提取方法

商宪丽,王学东   

  • 出版日期:2016-05-10 发布日期:2016-05-10

A Feature Selection Method based on Dynamic Coword Network for Microblog Topic Detection

  • Online:2016-05-10 Published:2016-05-10

摘要:

本文针对微博文本的简短、动态性等特征,提出一种新的文本特征提取方法,提升微博话题识别任务中文本聚类算法效果。利用词项共现的思想,针对微博时序文本构建动态共词网络。在动态共词网络中,边权重随着时间推移而线性衰减,并在此基础上利用网络的度中心性计算微博文本特征权重。从新浪微博中采样构建实验数据集进行实验,结果表明动态共词网络特征提取方法相较于文档频率方法,更适宜于提取微博文本特征,能取得更好的微博话题识别效果。

关键词: 微博, 话题识别, 动态共词网络, 特征提取, 文本聚类

Abstract:

The texts of microblog have some special characteristics, such as short and dynamic, which calls for new feature selection methods that are suitable for clustering algorithms to detect the topics from microblog texts. To address this problem, this paper utilizes the idea of co-occurrence to build the dynamic co-word network for microblog texts in timelines. In the dynamic co-word network, edge weights are decayed linearly over time. Then, the weights of text features are calculated according to the degree centrality measure of the network. The experiments are carried out on datasets that are sampled from Sina Weibo. It’s shown that the dynamic co-word network feature selection method is more suitable for extracting features of microblog texts and achieves better microblog topic detection over the conventional document frequency method.

Key words: Microblog, Topic detection, Dynamic co-word network, Feature selection, Text clustering