A method for extracting data from semis-tructured documents

K.A. Kudim, G.Yu. Proskudina

Abstract


Linguistic method to solve the problem of data extraction from weakly structured documents is developed, approved, and described in detail in the paper. Sample data were taken from thesis catalogue of Vernadsky National Library of Ukraine. The sequence of all stages is described: document collection choice; document preparation; writing grammar rules for data extraction from text; writing rules for morphology verification; creation of interpretations or bindings rules to data; analysis of parsing results. Linguistic method of data extraction showed many advantages in comparison to the method of data extraction with regular expressions described earlier.

Problems in programming 2020; 1: 25-32


Keywords


weakly structured documents; information extraction; linguistic analyzer; syntactic analyzer; morphological analysis; context-free grammar

References


KUDIM K.A., PROSKUDINA G.YU. (2019). Methods and tools for extracting personal data from theses abstracts Prob-lems in programming. [online – pp.isofts.kiev.ua] (2). P. 38–46. (in Rus-sian).

Available from: http://pp.isofts.kiev.ua/ojs1/ arti-cle/view/359 [Accessed 6/05/2019]. DOI: https://doi.org/10.15407/pp2019.02.038

RUBAILO A.V., KOSENKO M.Y. Soft-ware tools for information extraction from natural-language texts. Almanac of mod-ern science and education. № 12 (114) 2016. P.87-92. (in Russian). http://scjournal.ru /articles/issn_1993-5552_2016_12_23.pdf

KUKUSHKIN A. Natasha - a library for extracting structured information from texts in Russian.(in Russian).

https://habr.com/ru/post/349864/

EARLEY J. An efficient context-free pars-ing algorithm, Communications of the As-sociation for Computing Machinery, 13:2:94-102, 1970.


Refbacks

  • There are currently no refbacks.