Automated extraction of structured information from a variety of web pages

S.D. Pogorilyy; A.A. Kramov

doi:10.15407/pp2018.02.149

Automated extraction of structured information from a variety of web pages

S.D. Pogorilyy, A.A. Kramov

Abstract

The expediency of using methods of structured data extraction from a set of HTML pages for the information search in the Internet is substantiated. The main methods of structured data extraction from the set of web pages, which are formed by a common scenario with different sets of data, are analyzed. The classification of methods according to the degree of automation (the factor of user influence) of the template formation process is considered. The principles of work of the main unsupervised methods (Roadrunner, FiVaTech, Trinity) are described in detail. Advantages and disadvantages of methods are shown. The expediency of using the Trinity method for data extraction in comparison with other methods is substantiated. The problem of choosing input documents for method among a set of HTML pages for generating a common template is considered. Experimental verification of Trinity method on the set of HTML pages, which represent articles of Ukrainian scientific journals, is made. To create a test set of HTML pages, an automated crawl of web site is performed. The realization of the search bot is done by processing the object model of HTML documents obtained from web sites. Templates (regular expressions) formed by the Trinity method are applied to the entire set of input HTML pages. Extraction results (structured data about articles) are exported to the database with the possibility of further analysis. The obtained results are compared with the data about the articles obtained by the manual analysis of the object model of web pages. The error in using the Trinity method on the experimental set of HTML pages is calculated.

Problems in programming 2018; 2-3: 149-158

Keywords

data extraction; extraction methods; classification of extraction methods; Trinity method; ternary document tree; prefix tree traversal; DOM; automated collection of web pages; search bot; HTML page template; formation of a regular expression

Full Text:

PDF (Українська)

References

POTEBNIA, A. AND POGORILYY, S. (2015) Innovative GPU accelerated algorithm for fast minimum convex hulls computation. Proceedings of the Federated Conference on Computer Science and Information Systems. 5. p. 555-561.

https://doi.org/10.15439/2015F305

POGORILYY, S. AND SHKULIPA, I. (2009) A Conception for Creating a System of Parametric Design of Parallel Algorithms and their Software Implementations. Cybernetics and System Analysis. 54 (6). p. 952-958.

https://doi.org/10.1007/s10559-009-9172-7

WORLD WIDE WEB CONSORTIUM (2018) Semantic Web. [Online] Available from: https://www.w3.org/standards/semanticweb [Accessed: 12 February 2018].

W3TECHS - WEB TECHNOLOGY SURVEYS (2017) Usage of structured data formats for websites. [Online] Available from: https://w3techs.com/technologies/overview/structured_data/all [Accessed: 1 February 2018].

PATEL, D. AND THAKKAR, A. (2015) A Survey of Unsupervised Techniques for Web Data Extraction. International Journal Of Computer Science. 6 (2). p. 1-3.

CRESCENZI, V., MECCA, G., MERIALDO, P. (2001) RoadRunner: Towards Automatic Data Extraction from Large Web Sites. Proceedings of the 27th International Conference on Very Large Data Bases. Rome, Italy, 11-14 September 2001. San Francisco, CA: Morgan Kaufmann Publishers Inc.

KAYED, M. AND CHANG, C.-H. (2010) FiVaTech: Page-level web data extraction from template pages. IEEE Transactions on Knowledge and Data Engineering. 22 (2). p. 249-263.

https://doi.org/10.1109/TKDE.2009.82

SLEIMAN, H.A AND CORCHUELO, R. (2014) Trinity: On Using Trinary Trees for Unsupervised Web Data Extraction. IEEE Transactions on Knowledge and Data Engineering. 26 (6). p. 1544-1556.

https://doi.org/10.1109/TKDE.2013.161

INSTITUTE FOR INFORMATION RECORDING (2017) Data Rec., Storage & Processing. [Online] Available from: http://www.ipri.kiev.ua/index.php?id=52 [Accessed: 3 January 2018].

SYSTEM RESEARCH AND INFORMATION TECHNOLOGIES (2017) Archives. [Online] Available from: http://journal.iasa.kpi.ua [Accessed: 10 January 2018].

JSOUP: JAVA HTML PARSER (2017) jsoup Java HTML Parser 1.11.2 API. [Online] Available from: https://jsoup.org/apidocs/overview-summary.html [Accessed: 11 January 2018].

DOI: https://doi.org/10.15407/pp2018.02.149

Refbacks

There are currently no refbacks.

Username
Password
Remember me