pytorchVITvit_huge_patch14_224_in21k.pth

ViTImageNetCNNCIFAR10

ViT2-D1-Dpatch embeddingsNLPword embedding 2-D x R H W C H W .

Understanding the Context

ViTattentionHWViTBackbone1024 1024.

ViTSurveyGithub 2020GoogleViTNLPTra

3tokenizers: patch, rand, conv. patchvitpatchrandpatchconv23CIFAR10top-1 accuracy .

ViT CNNs 2. ViT Attention distance

Key Insights

ViT ViT Backbone ViTZOMISOTA ViT -> DeiT ->.

ViT: Transformer EncoderImageNet ILSVRC-2012ViT.

ViT TransformerSequenceViT patchesvector.