This article examines the effectiveness of pre-training generative model based on a visual transformer and subsequent fine tuning for image classification tasks. The main problem of the study is the poor training efficiency of the visual transformer on a limited amount of data. It is possible to improve the accuracy of the image classification model by using transfer learning of the knowledge obtained during the previous training of the generative model on the same data. A subset of the standard Imagenet dataset - Tiny Imagenet was used to test the hypothesis. It contains 200 categories of around 500 images each. The size of each image is 64x64 pixels. For pre-training the generative model, patches are used to mask image segments. The training of restoring masked image pixels forces the model to pay attention to the context around the removed part, as well as to general visual patterns. This leads to a better understanding of visual information by the model as a whole and helps with further fine tuning of the model for the classification task. As a result of a series of experiments, it was possible to achieve an improvement in the accuracy of image classification from 40% to 44.7%, and an analysis of the effect of the overall degree of masking and patch size on it is given. Additionally, impact of different sizes of patches (2x2, 4x4, 8x8 pixels) and different percentages of masking (20/40/60 percent) of the input image were investigated in the paper.
