Methods and software for significant indicators determination of the natural language texts author profile

V.I. Shynkarenko, I.M. Demydovych

Abstract


Methods for the formation and optimization of author profiles are presented. The author profile is an image – a vector in a multidimensional space, which components are author’s texts measurements by a number of methods based on 4-grams, stemming, recurrence analysis and formal stochastic grammar. The author’s profile is a model of his language, including vocabulary, sentence syntax features. A comparative analysis of each of the methods effectiveness is carried out. By means of the genetic algorithm, a reduced profile of the author is formed. Insignificant indicators are excluded, which allows to reduce their number by 20%. The reduced author’s profile contains attributes that are significant for this author and is an effective attribution of a particular author.

Prombles in programming 2023; 3: 22-29





Keywords


natural language texts; authorship determination; genetic algorithm; recurrent analysis; statistical analysis; text classification; pattern recognition; formal grammars

Full Text:

PDF

References


H. Love. 2002. Attributing Authorship: An Introduction. Cambridge University Press. CrossRef

Aidan Finn and Nicholas Kushmerick. 2003. Learning to classify documents according to genre. In IJCAI-03 Workshop on Computational Approaches to Style Analysis and Synthesis.

D. Khmelev and W. Teahan. 2003. A repetition based measure for verification of text collections and for text categorization. In SIGIR'2003, Toronto, Canada. CrossRef

M. Ephratt. 1997. Authorship attribution - the case of lexical innovations. In Proc. ACHALLC-97.

E. Stamatatos, N. Fakotakis, and G. Kokkinakis. 2001. Computer-based authorship attribution without lexical measures. Computers and the Humanities, 35:193-214. CrossRef

S. Scott and S. Matwin. 1999. Feature engineering for text classification. In Proceedings ICML-99.

A. Aizawa. 2001. Linguistic techniques to improve the performance of automatic text categorization.In Proceedings 6th NLP Pac. Rim Symp. NLPRS-01.

Darchuk N. 2023. Automatic frequency dictionary of connectivity by Lina Kostenko and Mykola Vingranovskyi. Linguistic and conceptual pictures of the world, 73 (1) CrossRef

Danyliuk, I., Zagnitko, A. and Sytar, G., 2019. Text corpus of Yury Shevelyov: structure, functions, navigation. APPLIED LINGUISTICS. LINGUISTICS. CrossRef

Kuzma, K.T., 2020. Information technology for estimating the level of simslarity of strings based on the N-gram method. Academic notes of TNU named after V.I. Vernadskyi. Series: technical sciences. 31 (7), p. 96-98. CrossRef

H. Gómez-Adorno, JP. Posadas-Durán, G. Sidorov, Document embeddings learned on various types of n-grams for cross-topic authorship attribution. Computing 100 (2018) 741-756. CrossRef

V.I. Shynkarenko, I.M. Demidovich Determination of the attributes of authorship of natural texts. Artificial Intelligence 3 (2018) 27-35.

V.I. Shynkarenko, I.M. Demidovich Authorship Determination of Natural Language Texts by Several Classes of Indicators with Customizable Weights, in: Proceedings of the 5th International Conference on Computational Linguistics and Intelligent Systems (COLINS 2021). Volume I: Main Conference. Lviv, Ukraine, April 22-23, 2021, pp. 832-844.

T. V. Golub, M. Yu. Tyagunova, Method of steaming Ukrainian-language texts for classification of documents based on Porter's algorithm. Scientific works of Donetsk National Technical University. Series: Informatics, cybernetics and computer engineering No. 1(24) (2017) 59-63.

Dukhnovska KK, Strashok YaA, Shilo PV. Information technology for performing lemmatization and steming in Ukrainian-language texts. Applied systems and technologies in the information society. pp.. 119-127.

S. Memon, K. Memon, F. Dehraj and others. 2020. Comparative Study of Truncating and Statistical Stemming Algorithms. International Journal of Advanced Computer Science and Applications. CrossRef

Great electronic dictionary of the Ukrainian language (VESUM). URL: https://github.com/brown-uk/dict_uk.

I. Demidovich, V. Shynkarenko, O. Kuropiatnyk, O. Kirichenko, Processing Words Effectiveness Analysis in Solving the Natural Language Texts Authorship Determination Task, XVI International Scientific and Technical Conference (CSIT'2021). September 22-25, 2021, Lviv, Ukraine. CrossRef

V. I. Shynkarenko, I. M. Demidovich Natural Language Texts Authorship Establishing Based on the Sentences Structure, in: Proceedings of the 6th International Conference on Computational Linguistics and Intelligent Systems (COLINS 2022), Volume I: Main Conference, Gliwice, Poland, May 22-23, 2022, pp. 328-337.




DOI: https://doi.org/10.15407/pp2023.03.022

Refbacks

  • There are currently no refbacks.