Yang Xian, Tang Chao-lan, Li Hang. Text Extraction Based on Text Block Density with Tag Path and Other Features[J]. JOURNAL OF GUANGDONG UNIVERSITY OF TECHNOLOGY, 2018, 35(2): 51-56. DOI:
1. School of Art and Design, Guangdong University of Technology, Guangzhou, 510090, China
2. School of computers, Guangdong University of Technology, Guangzhou 510006, China
: Most of web pages contain content information as well as a lot of noisy information. In order to address this problem and improve the accuracy of web page extraction, a web page extraction method is proposed via text block density with tap path and other features. The proposed method mostly combines the advantages of text block extraction method and label path extraction method. First, the block of the text is determined according to the density feature of the text block, and then the tag path method is used to remove the noisy node in the block, the text node in the text block is extracted from the content finally. This solution effectively solves the problem that the noisy information in the text block is difficult to filter and the tag path method is easy to extract the long text from the noisy block. In the end, experiments show that the solution is better than CETR and CETD in most cases.
广东工业大学学报, 2016, 33(05): 49-53.
HE K D, ZHU Z T, CHENG Y. A Research on Text Classification Method Based on Improved TF-IDF Algorithm[J].
Journal of Guangdong University of Technology, 2016, 33(05): 49-53.
LIU P F, QIU X P, HUANG X J, Adversarial multi-task learning for text classification[C] //Annual meeting of the association for computational linguistics. Vancouver: Transactions of The ACL Journal, 2017: 1-10.
KIM M, KIM Y, SONG W, et al. Main content extraction from web documents using text block context[C] //Database and expert systems applications. Heidelberg: Springer, 2013: 81-93.
计算机应用, 2014, 34(3): 733-737.
LI P, ZHU J B, ZHOU L X. Shopping information extraction method based on rapid construction of template[J].
Journal of Computer Applications, 2014, 34(3): 733-737.
杨贤, 何汉武. 基于互联网文本挖掘的用户意图感知[J].
广东工业大学学报, 2017, 34(03): 54-58.
YANG X, HE H W. Internet Text Mining for User Intent Perception[J].
Journal of Guangdong University of Technology, 2017, 34(03): 54-58.
WENINGER T, HSU W H, HAN J. CETR: content extraction via tag ratios[C] //International conference on world wide web. Raleigh: ACM, 2010: 971-980.
WENINGER T, HSU W H. Text extraction from the web via text-to-tag ratio[C] //International workshop on database and expert systems application. Turin: IEEE, 2008: 23-28.
SUN F, SONG D, LIAO L. DOM based content extraction via text density[C] //International ACM SIGIR conference on research and development in information retrieval. Beijing: ACM, 2011: 245-254.
WUGQ, LIL, LIL. Web News Extraction via Tag Path Feature Fusion Using DS Theory[J].
计算机科学技术学报(英文版), 2016, 31(4): 661-672.