Distributional semantic modeling: a revised technique to train term/word vector space models applying the ontology-related approach

O.V. Palagin; V.Yu. Velychko; K.S. Malakhov; O.S. Shchurov

doi:10.15407/pp2020.02-03.341

Distributional semantic modeling: a revised technique to train term/word vector space models applying the ontology-related approach

O.V. Palagin, V.Yu. Velychko, K.S. Malakhov, O.S. Shchurov

Abstract

We design a new technique for the distributional semantic modeling with a neural network-based approach to learn distributed term representations (or term embeddings) – term vector space models as a result, inspired by the recent ontology-related approach (using different types of contextual knowledge such as syntactic knowledge, terminological knowledge, semantic knowledge, etc.) to the identification of terms (term extraction) and relations between them (relation extraction) called semantic pre-processing technology – SPT. Our method relies on automatic term extraction from the natural language texts and subsequent formation of the problem-oriented or application-oriented (also deeply annotated) text corpora where the fundamental entity is the term (includes non-compositional and compositional terms). This gives us an opportunity to changeover from distributed word representations (or word embeddings) to distributed term representations (or term embeddings). The main practical result of our work is the development kit (set of toolkits represented as web service APIs and web application), which provides all necessary routines for the basic linguistic pre-processing and the semantic pre-processing of the natural language texts in Ukrainian for future training of term vector space models.

Problems in programming 2020; 2-3: 341-351

Keywords

distributional semantics; vector space model; word embedding; term extraction; term embedding; ontology; ontology engineering

Full Text:

PDF

References

Turney P.D. & Pantel P. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research. 2010. 37(1). P. 141-188. CrossRef

Ganegedara T. 2018. Natural Language Processing with TensorFlow: Teach language to machines using Python's deep learning library. Packt Publishing Ltd.

Kutuzov A. & Andreev I.A. 2015. Texts in, meaning out: neural language models in semantic similarity task for Russian. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual Conference "Dialogue". Moscow, May 27 - 30. Moscow: RGGU. Issue 14 (21).

Kutuzov A. 2014. Semantic clustering of Russian web search results: possibilities and problems. In Russian Summer School in Information Retrieval. Aug 18-22. Cham: Springer. P. 320-331. CrossRef

Sienčnik S.K. Adapting word2vec to named entity recognition. In: Proceedings of the 20th Nordic conference of computational linguistics. Nodalida, May 11-13. Vilnius: Linköping University Electronic Press. 2015. N 109. P. 239-243.

Katricheva N., Yaskevich A., Lisitsina A., Zhordaniya T., Kutuzov A., Kuzmenko E. 2020. Vec2graph: A Python Library for Visualizing Word Embeddings as Graphs. In: van der Aalst W. et al. (eds) Analysis of Images, Social Networks and Texts. AIST 2019. Communications in Computer and Information Science. Vol. 1086. Springer, Cham. CrossRef

Maas A.L., Daly R.E., Pham P.T., Huang D., NG, A.Y. and potts C. 2011. Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies volume 1. Association for Computational Linguistics. P. 142-150.

Palagin A.V., Petrenko N.G., Malakhov K.S. Technique for designing a domain ontology. Computer means, networks and systems. 2011. N 10. P. 5-12.

Palagin O.V., Petrenko M.G. and Kryvyi S.L. 2012. Ontolohichni metody ta zasoby obrobky predmetnykh znan. Publishing center of V. Dahl East Ukrainian National University.

Palagin A.V., Petrenko N.G., Velychko V.YU. and Malakhov K.S. 2014. Development of formal models, algorithms, procedures, engineering and functioning of the software system "Instrumental complex for ontological engineering purpose". In: Proceedings of the 9th International Conference of Programming UkrPROG. CEUR Workshop Proceedings 1843. Kyiv, Ukraine, May 20-22, 2014. [Online] Available from: http://ceur-ws.org/Vol-1843/221-232.pdf [Accessed: 03 February 2020].

Velychko V.YU., Malakhov K.S., Semenkov V.V., Strizhak A.E. Integrated Tools for Engineering Ontologies. Information Models and Analyses. 2014. N 4. P. 336-361.

Palagin A.V., Petrenko N.G., Velichko V.YU., Malakhov K.S. and Tikhonov YU.L. To the problem of "The Instrumental complex for ontological engineering purpose" software system design. Problems in programming. 2012. N 2-3. P. 289-298.

Mikolov T., Chen K., Corrado G.S. and Dean J.A., Google LLC. 2015. Computing numeric representations of words in a high-dimensional space. U.S. Patent 9,037,464.

Google Code Archive. Word2vec tool for computing continuous distributed representations of words. [Online] Available from: https://code.google.com/archive/p/word2vec [Accessed: 03 February 2020].

fastText. Library for efficient text classification and representation learning. [Online] Available from: https://fasttext.cc [Accessed: 03 February 2020].

Mikolov T., Chen K., Corrado G. and Dean J. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Bojanowski P., Grave E., Joulin A. and Mikolov T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics. 2017. 5. P. 135-146. CrossRef

Joulin A., Grave E., Bojanowski P., Douze M., Jégou H. and Mikolov T. 2016. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.

Encyclopædia Britannica. John R. Firth. [Online] Available from: https://www.britannica.com/biography/John-R-Firth [Accessed: 03 February 2020].

Mikolov T., Sutskever I., Chen K., Corrado G.S. and Dean J. Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems. 2013. P. 3111-3119.

Mikolov T., Yih W.T. and Zweig G. Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies. 2013. P. 746-751.

Joulin A., Grave E., Bojanowski P. and Mikolov T. 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.

CrossRef

Peters M.E., Neumann M., Iyyer M., Gardner M., Clark C., Lee K. and Zettlemoyer L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365. CrossRef

AllenNLP an open-source NLP research library. ELMo. [Online] Available from: https://allennlp.org/elmo [Accessed: 03 February 2020].

Gensim: Topic modelling for humans. [Online] Available from: https://radimrehurek.com/gensim [Accessed: 03 February 2020].

Luo Q. and Xu W. Learning word vectors efficiently using shared representations and document representations. In: Twenty-Ninth AAAI Conference on Artificial Intelligence. AAAI Press. 2015. P. 4180-4181.

Luo Q., Xu W. and Guo J. A Study on the CBOW Model's Overfitting and Stability. In: Proceedings of the 5th International Workshop on Web-scale Knowledge Representation Retrieval & Reasoning. Association for Computing Machinery. 2014. P. 9-12. CrossRef

Mnih A. and Kavukcuoglu K. Learning word embeddings efficiently with noise-contrastive estimation. In: Advances in neural information processing systems. Curran Associates Inc. 2013. P. 2265-2273.

Srinivasa-Desikan B. (2018). Natural Language Processing and Computational Linguistics: A practical guide to text analysis with Python, Gensim, spaCy, and Keras. Packt Publishing Ltd.

Nikolenko S., Kadurin A., and Arkhangel'skaya E. (2018) Glubokoe obuchenie. Pogruzhenie v mir neironnykh setei (Deep Learning: An Immersion in the World of Neural Networks). St. Petersburg: Piter.

Goyal P., Pandey S., & Jain K. (2018). Deep learning for natural language processing. Deep Learning for Natural Language Processing: Creating Neural Networks with Python. Berkeley, CA]: Apress. CrossRef

Maynard D., Bontcheva K., & Augenstein I. (2017). Natural language processing for the semantic web. Synthesis Lectures on the Semantic Web: Theory and Technology. Morgan & Claypool Publishers.

CrossRef

Wikimedia Downloads. [Online] Available from: https://dumps.wikimedia.org [Accessed: 03 February 2020].

spaCy. Industrial-strength Natural Language Processing in Python. [Online] Available from https://spacy.io [Accessed: 03 February 2020].

Natural Language Toolkit. NLTK 3.4.5. [Online] Available from https://www.nltk.org [Accessed: 03 February 2020].

StanfordNLP. Python NLP Library for Many Human Languages. [Online] Available from https://stanfordnlp.github.io/stanfordnlp [Accessed: 03 February 2020].

Gensim: API Reference. [Online] Available from https://radimrehurek.com/gensim/apiref.html [Accessed: 03 February 2020].

Le Q. and Mikolov T. Distributed representations of sentences and documents. In International conference on machine learning. 2014. P. 1188-1196.

Caselles-Dupré H., Lesaint F. and Royo-Letelier J. Word2vec applied to recommendation: Hyperparameters matter. In: Proceedings of the 12th ACM Conference on Recommender Systems. 2018. P. 352-356. CrossRef

Rong X. (2014). word2vec parameter learning explained. arXiv preprint arXiv:1411.2738.

AWS Machine Learning Blog. How to Deploy Deep Learning Models with AWS Lambda and Tensorflow. [Online] Available from https://aws.amazon.com/blogs/machine-learning/how-to-deploy-deep-learning-models-with-aws-lambda-and-tensorflow [Accessed: 03 February 2020].

Firefly. Firefly documentation. [Online] Available from https://rorodata.github.io/firefly [Accessed: 03 February 2020].

Common Crawl. [Online] Available from http://commoncrawl.org [Accessed: 03 February 2020].

Google Dataset Search. [Online] Available from https://datasetsearch.research.google.com [Accessed: 03 February 2020].

Vec2graph mini-library for producing graph visualizations from embedding models [Online] Available from: https://github.com/lizaku/vec2graph [Accessed: 03 February 2020].

Palagin A.V., Svitla S.JU., Petrenko M.G., Velychko V.JU. About one approach to analysis and understanding of the natural. Computer means, networks and systems. 2008. N 7. P. 128-137.

Gladun V.P. 1994. Processy formirovanija novyh znanij [Processes of formation of new knowledge]. Sofija: SD «Pedagog 6» - Sofia: ST «Teacher 6», 192 [in Russian].

Dobrov B., Loukachevitch N., Nevzorova O. 2003. The technology of new domains' ontologies development. Proceedings of the X-th International Conference "Knowledge-Dialogue-Solution" (KDS'2003). Varna, Bulgaria. 2003. P. 283-290.

Velychko V., Voloshin P., Svitla S. 2009. Avtomatizirovannoe sozdanie tezaurusa terminov predmetnoj oblasti dlja lokal'nyh poiskovyh sistem. International Book Series "Information Science & Computing". Book No: 15. Knowledge - Dialogue - Solution, Sofia, 2009. P. 24-31. CrossRef

Palagin O.V., Velychko V.YU., Malakhov K.S., Shchurov O.S. (2018) Research and development workstation environment: the new class of current research information systems In: Proceedings of the 11th International Conference of Programming UkrPROG 2018. CEUR Workshop Proceedings 2139. Kyiv, Ukraine, May 22-24, 2018. [Online] Available from: http://ceur-ws.org/Vol-2139/255-269.pdf [Accessed: 03 February 2020].

DOI: https://doi.org/10.15407/pp2020.02-03.341

Refbacks

There are currently no refbacks.

Username
Password
Remember me