Modelling videocard memory  performance for LLM neural networks

D.V. Rahozin; A.Yu. Doroshenko

doi:10.15407/pp2024.02-03.037

Modelling videocard memory performance for LLM neural networks

D.V. Rahozin, A.Yu. Doroshenko

Abstract

The paper covers the analysis of performance characteristics of neural network-based algorithms class Generative pre-trained transformer, also known as Large Language Model, for contemporary videocards. The goal is to check the application limitations for this class of neural networks for mobile computing platforms. This network class is interesting for use as control system tool, but for the bigger text corpuses the network performance degrades and the number of used computer resources grows quickly, so we need to explore if this type of network, but based on a smaller text corpus, is a feasible tool for devices with comparatively low computing capability. The performance investigation was performed with the help of GPGPUSim simulator, which can be freely configured as a virtual videocard of any computing capability. As this neural network computations are based on the calculation of a sequence of matrix multiplications, and its performance is limited by the memory bandwidth, we analyze the behavior statistics of the different cache memory levels of the videocard processor and the cache interaction with the main memory. With the help of GPGPUSim we have gathered statistics for different Generative pre-trained transformer version 2 configurations, from small to xl configurations. The level 2 cache memory access statistics, level 2 cache memory access misses, number of accesses to main memory show that even for the middle-level network configurations the number of level 2 cache memory misses exceeds the 7-8% level. This number looks high and this evidence shows that the size of the cache memory is quite small for executing this neural network configuration, also there is the substantial traffic from cache memory to main memory. Although, the minimal so-called small configuration can be computed faster and with moderated resources, and so can be used further as a part of decision-making system for computing platforms with moderate performance and resources for the case of limited text corpuses. This opens good enough perspectives for using this type of neural networks for autonomous decision making.

Prombles in programming 2024; 2-3: 37-44

Keywords

videocard; large language model; computing performance; neural network; computational capability

Full Text:

PDF (Українська)

References

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L .Jones, A. Gomez, L. Kaizer, I. Polosukhin. Attention is all you need. // In proc. 31st Conf. on Neural Information Processing Systems (NIPS), Dec. 2017, pp. 6000-6010.

S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro B, et al. CuDnn: efficient primitives for deep learning. arXiv preprint: arXiv:14100759. (2014).

Rahozin D., Doroshenko A. (2022) Performance Model for Convolutional Neural Networks. In: Shkarlet S. et al. Mathematical Modeling and Simulation of Systems. MODS Lecture Notes in Networks and Systems, vol 344. Springer, pp. 239-251. CrossRef

Y. Guo, Y. Liu, T. Georgiou et al. A review of semantic segmentation using deep neural networks. Int J Multimed Info Retr 7, 87-93 (2018). CrossRef

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan and Y Cao. ReAct: Synergizing Reasoning and Acting in Language Models. (2022) arxiv.org/abs/2210.03629.

A. Karpathy A. NanoGPT. Available from: github.com/karpathy/nanoGPT

B. Hosmer. Inside the Matrix: Visualizing Matrix Multiplication, Attention and Beyond. (25 Sept 2023) Available from: pytorch.org/blog/ inside-the-matrix/

A. Golden, S. Hsia, F. Sun, B. Acun, B. Hosmer, Y. Leeet al. Generative AI Beyond LLMs: System Implications of Multi-Modal Generation. (2023) Av ailable from: arxiv.org/abs/2312.14385

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, B. Catanzaro. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. (2020) Available at: arxiv.org/abs/1909.08053

J. Lew, D. Shah, S. Pati, S. Cattell, M. Zhang et all. Analyzing Machine Learning Workloads Using a Detailed GPU Simulator (ISPASS 2019) CrossRef

T. Brown, B. Mann, N. Ryder, M. Subbiah, J.D. Kaplan, P. Dhariwal et all. Language Models are Few-Shot Learners. // In Proc. of 34th Conf. on Neural Information Processing Systems (Dec 2020), Vancouver, Canada. P. 1877-1901. doi: 10.5555/3495724.3495883

DOI: https://doi.org/10.15407/pp2024.02-03.037

Refbacks

There are currently no refbacks.

Username
Password
Remember me