ViT全流程笔记，附代码详解。_vit 的unembedding-程序员宅基地

一、课程介绍

Vision Transformer是近期深度学习领域最前沿、最火爆的技术，本次课程由百度研究院深度学习实验室研究员朱欤博士主讲，将通过图解理论基础、手推公式以及从0开始逐行手敲代码，带大家实现最前沿的视觉Transformer算法！通过Vision Transformer十讲的学习，能一步一步将论文中的模型图变成一行行的代码，从零搭建一套自己的深度学习模型，掌握和实践最新的技术，告别简单的git clone和调包。

从零开始学视觉Transformer

PaddleViT GitHub地址

二、课程笔记

2.1 ViT整体结构

Encoder模块的线性堆叠，Encoder模块的核心内容是Multi Head Attention。输入[N C H W]，输出[N num_classes]。

2.1 ViT网络搭建

分别要构建三个类：Patch Embedding、Encoder和Classify，其中Encoder又包括两个类Multi Head Attention和MLP。

2.3 注意力机制计算公式
$\operatorname{Attention}(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V$

2.4 Feature Map

Patch Embedding是用卷积运算来操作的，Self Attention是用全连接层来操作的。最后输出为[N, C * H + 1, embed_dim]，+1是因为加入了Class Token。

2.5 BatchNorm和LayerNorm

ViT做归一化采用的是LN层，两者有一定区别。

三、课程代码

class ViT

import paddle
import paddle.nn as nn

class ViT(nn.Layer):
    def __init__(self,
                 image_size=224,
                 patch_size=16,
                 in_channels=3,
                 num_classes=1000,
                 embed_dim=768,
                 depth=12,
                 num_heads=12,
                 mlp_ratio=4,
                 qkv_bias=True,
                 dropout=0.,
                 attention_dropout=0.,):
        super(ViT, self).__init__()
        #creat patch embedding with positional embedding
        self.patch_embedding = PatchEmbedding(image_size, 
                                              patch_size, 
                                              in_channels, 
                                              embed_dim, 
                                              dropout)

        #creat multi head self-attention layers encoder
        self.encoder = Encoder( embed_dim,
                                num_heads, 
                                qkv_bias,
                                mlp_ratio,
                                dropout, 
                                attention_dropout,
                                depth )

        #classifier head for num classes
        self.classifier = Classify(embed_dim, dropout, num_classes)

    def forward(self, x):
        # input [N, C, H', W']
        x = self.patch_embedding(x) #[N, C * H + 1, embed_dim]
        x = self.encoder(x)         #[N, C * H + 1, embed_dim]
        x = self.classifier(x[:, 0, :])      #[N, num_classes]

        return x

3.1 class PatchEmbedding

class PatchEmbedding(nn.Layer):
    def __init__(self,
                image_size = 224,
                patch_size = 16,
                in_channels = 3,
                embed_dim = 768,
                dropout = 0.):
        super(PatchEmbedding, self).__init__()

        n_patches = (image_size // patch_size) * (image_size // patch_size) #14 * 14 = 196(个)

        self.patch_embedding = nn.Conv2D(in_channels = in_channels,
                                         out_channels = embed_dim,
                                         kernel_size = patch_size,
                                         stride = patch_size)
        
        self.dropout=nn.Dropout(dropout)

        #add class token
        self.cls_token = paddle.create_parameter(
                                        shape = [1, 1, embed_dim],
                                        dtype = 'float32',
                                        default_initializer = paddle.nn.initializer.Constant(0)
                                        #常量初始化参数，value=0， shape=[1, 1, 768]
                                        )

        #add position embedding
        self.position_embeddings = paddle.create_parameter(
                                        shape = [1, n_patches + 1, embed_dim],
                                        dtype = 'float32',
                                        default_initializer = paddle.nn.initializer.TruncatedNormal(std = 0.02)
                                        #随机截断正态（高斯）分布初始化函数
                                        )

    def forward(self, x):
        x = self.patch_embedding(x) #[N, C, H', W',]  to  [N, embed_dim, H, W]卷积层
        x = x.flatten(2)            #[N, embed_dim, H * W]
        x = x.transpose([0, 2, 1])  #[N, H * W, embed_dim]

        cls_token = self.cls_token.expand((x.shape[0], -1, -1)) #[N, 1, embed_dim]
        x = paddle.concat((cls_token, x), axis = 1)             #[N, H * W + 1, embed_dim]
        x = x + self.position_embeddings                        #[N, H * W + 1, embed_dim]
        x = self.dropout(x)

        return x

3.2 class Encoder

class Encoder(nn.Layer):
    def __init__(self,
                 embed_dim,
                 num_heads, 
                 qkv_bias,
                 mlp_ratio,
                 dropout, 
                 attention_dropout,
                 depth):
        super(Encoder, self).__init__()
        layer_list = []
        for i in range(depth):
            encoder_layer = EncoderLayer(embed_dim,
                                        num_heads, 
                                        qkv_bias,
                                        mlp_ratio,
                                        dropout, 
                                        attention_dropout)
            layer_list.append(encoder_layer)
        self.layers = nn.LayerList(layer_list)# or nn.Sequential(*layer_list)
        self.norm = nn.LayerNorm(embed_dim)

    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        x = self.norm(x)
        return x

class EncoderLayer(nn.Layer):
    def __init__(self, 
                 embed_dim,
                 num_heads, 
                 qkv_bias,
                 mlp_ratio,
                 dropout, 
                 attention_dropout
                 ):
        super(EncoderLayer, self).__init__()
        #Multi Head Attention & LayerNorm
        w_attr_1, b_attr_1 = self._init_weights()
        self.attn_norm = nn.LayerNorm(embed_dim, 
                                      weight_attr = w_attr_1,
                                      bias_attr = b_attr_1,
                                      epsilon = 1e-6)
        self.attn = Attention(embed_dim,
                              num_heads,
                              qkv_bias,
                              dropout,
                              attention_dropout)

        #MLP & LayerNorm
        w_attr_2, b_attr_2 = self._init_weights()
        self.mlp_norm = nn.LayerNorm(embed_dim,
                                     weight_attr = w_attr_2,
                                     bias_attr = b_attr_2,
                                     epsilon = 1e-6)
        self.mlp = Mlp(embed_dim, mlp_ratio, dropout)

    def _init_weights(self):
        weight_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0.0))
        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(1.0))
        return weight_attr, bias_attr

    def forward(self, x):
        h = x                   #[N, H * W + 1, embed_dim]
        x = self.attn_norm(x)   #Attention LayerNorm
        x = self.attn(x)        #[N, H * W + 1, embed_dim]
        x = h + x               #Add

        h = x                   #[N, H * W + 1, embed_dim]
        x = self.mlp_norm(x)    #MLP LayerNorm
        x = self.mlp(x)         #[N, H * W + 1, embed_dim]
        x = h + x               #[Add]
        return x

3.2.1 class Attention

class Attention(nn.Layer):
    def __init__(self,
                 embed_dim, 
                 num_heads, 
                 qkv_bias, 
                 dropout, 
                 attention_dropout):
        super(Attention, self).__init__()
        self.num_heads = num_heads
        self.attn_head_size = int(embed_dim / self.num_heads)
        self.all_head_size = self.attn_head_size * self.num_heads
        self.scales = self.attn_head_size ** -0.5

        #calculate qkv
        w_attr_1, b_attr_1 = self._init_weights()
        self.qkv = nn.Linear(embed_dim, 
                             self.all_head_size * 3, # weight for Q K V
                             weight_attr = w_attr_1,
                             bias_attr = b_attr_1 if qkv_bias else False)

        #calculate proj
        w_attr_2, b_attr_2 = self._init_weights()
        self.proj = nn.Linear(embed_dim,
                              embed_dim, 
                              weight_attr=w_attr_2,
                              bias_attr=b_attr_2)

        self.attn_dropout = nn.Dropout(attention_dropout)
        self.proj_dropout = nn.Dropout(dropout)
        self.softmax = nn.Softmax(axis=-1)

    def _init_weights(self):
        weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
        bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
        return weight_attr, bias_attr

    def transpose_multihead(self, x):
        #input size  [N, ~, embed_dim]
        new_shape = x.shape[0:2] + [self.num_heads, self.attn_head_size]
        #reshape size[N, ~, head, head_size]
        x = x.reshape(new_shape)
        x = x.transpose([0, 2, 1, 3])
        #transpose   [N, head, ~, head_size]
        return x

    def forward(self, x):
        #input x = [N, H * W + 1, embed_dim]
        qkv = self.qkv(x).chunk(3, axis = -1)           #[N, ~, embed_dim * 3]  list
        q, k, v = map(self.transpose_multihead, qkv)    #[N, head, ~, head_size]
        
        attn = paddle.matmul(q, k, transpose_y = True)  #[N, head, ~, ~]
        attn = self.softmax(attn * self.scales)         #softmax(Q*K/(dk^0.5))
        attn = self.attn_dropout(attn)                  #[N, head, ~, ~]
        
        z = paddle.matmul(attn, v)                      #[N, head, ~, head_size]
        z = z.transpose([0, 2, 1, 3])                   #[N, ~, head, head_size]
        new_shape = z.shape[0:2] + [self.all_head_size]
        z = z.reshape(new_shape)                        #[N, ~, embed_dim]
        z = self.proj(z)                                #[N, ~, embed_dim]
        z = self.proj_dropout(z)                        #[N, ~, embed_dim]

        return z

3.2.2 class Mlp

class Mlp(nn.Layer):
    def __init__(self,
                 embed_dim,
                 mlp_ratio,
                 dropout):
        super(Mlp, self).__init__()
        #fc1
        w_attr_1, b_attr_1 = self._init_weights()
        self.fc1 = nn.Linear(embed_dim, 
                            int(embed_dim * mlp_ratio), 
                            weight_attr = w_attr_1, 
                            bias_attr = b_attr_1)
        #fc2
        w_attr_2, b_attr_2 = self._init_weights()
        self.fc2 = nn.Linear(int(embed_dim * mlp_ratio),
                            embed_dim, 
                            weight_attr = w_attr_2, 
                            bias_attr = b_attr_2)

        self.act = nn.GELU()#GELU > ELU > ReLU > sigmod
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    def _init_weights(self):
        weight_attr = paddle.ParamAttr(
            initializer=paddle.nn.initializer.XavierUniform())  
            #XavierNormal正态分布的所有层梯度一致，XavierUniform均匀分布的所有成梯度一致。
        bias_attr = paddle.ParamAttr(
            initializer=paddle.nn.initializer.Normal(std=1e-6)) #正态分布的权值和偏置
        return weight_attr, bias_attr

    def forward(self, x):
        x = self.fc1(x)         #[N, ~, embed_dim]
        x = self.act(x)
        x = self.dropout1(x)
        x = self.fc2(x)         #[N, ~, embed_dim]
        x = self.dropout2(x)
        return x

3.3 class Classify

class Classify(nn.Layer):
    def __init__(self, embed_dim, dropout, num_classes):
        super(Classify, self).__init__()
        #fc1
        w_attr_1, b_attr_1 = self._init_weights()
        self.fc1 = nn.Linear(embed_dim, 
                            embed_dim,
                            weight_attr = w_attr_1,
                            bias_attr = b_attr_1)
        #fc2
        w_attr_2, b_attr_2 = self._init_weights()
        self.fc2 = nn.Linear(embed_dim, 
                            num_classes,
                            weight_attr = w_attr_2,
                            bias_attr = b_attr_2)

        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.relu = nn.ReLU()  

    def _init_weights(self):
        weight_attr = paddle.ParamAttr(
            initializer=paddle.nn.initializer.KaimingUniform())
        bias_attr = paddle.ParamAttr(
            initializer=paddle.nn.initializer.KaimingUniform())
        return weight_attr, bias_attr

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.dropout1(x)
        x = self.fc2(x)
        x = self.dropout2(x)
        return x

def main():
        ins = paddle.randn([1, 3, 224, 224])
        model = ViT()
        out = model(ins)
        print(out.shape)
        paddle.summary(model, (1, 3, 224, 224))

if __name__ == "__main__":
      x = self.fc2(x)
        x = self.dropout2(x)
        return x

def main():
        ins = paddle.randn([1, 3, 224, 224])
        model = ViT()
        out = model(ins)
        print(out.shape)
        paddle.summary(model, (1, 3, 224, 224))

if __name__ == "__main__":
    main()

[1, 1000]
----------------------------------------------------------------------------
  Layer (type)       Input Shape          Output Shape         Param #    
============================================================================
    Conv2D-1      [[1, 3, 224, 224]]    [1, 768, 14, 14]       590,592    
   Dropout-1       [[1, 197, 768]]       [1, 197, 768]            0       
PatchEmbedding-1  [[1, 3, 224, 224]]     [1, 197, 768]         152,064    
  LayerNorm-1      [[1, 197, 768]]       [1, 197, 768]          1,536     
    Linear-1       [[1, 197, 768]]       [1, 197, 2304]       1,771,776   
   Softmax-1     [[1, 12, 197, 197]]   [1, 12, 197, 197]          0       
   Dropout-2     [[1, 12, 197, 197]]   [1, 12, 197, 197]          0       
    Linear-2       [[1, 197, 768]]       [1, 197, 768]         590,592    
   Dropout-3       [[1, 197, 768]]       [1, 197, 768]            0       
  Attention-1      [[1, 197, 768]]       [1, 197, 768]            0       
  LayerNorm-2      [[1, 197, 768]]       [1, 197, 768]          1,536     
    Linear-3       [[1, 197, 768]]       [1, 197, 3072]       2,362,368   
     GELU-1        [[1, 197, 3072]]      [1, 197, 3072]           0       
   Dropout-4       [[1, 197, 3072]]      [1, 197, 3072]           0       
    Linear-4       [[1, 197, 3072]]      [1, 197, 768]        2,360,064   
   Dropout-5       [[1, 197, 768]]       [1, 197, 768]            0       
     Mlp-1         [[1, 197, 768]]       [1, 197, 768]            0       
 EncoderLayer-1    [[1, 197, 768]]       [1, 197, 768]            0       
  LayerNorm-3      [[1, 197, 768]]       [1, 197, 768]          1,536     
    Linear-5       [[1, 197, 768]]       [1, 197, 2304]       1,771,776   
   Softmax-2     [[1, 12, 197, 197]]   [1, 12, 197, 197]          0       
   Dropout-6     [[1, 12, 197, 197]]   [1, 12, 197, 197]          0       
    Linear-6       [[1, 197, 768]]       [1, 197, 768]         590,592    
   Dropout-7       [[1, 197, 768]]       [1, 197, 768]            0       
  Attention-2      [[1, 197, 768]]       [1, 197, 768]            0       
  LayerNorm-4      [[1, 197, 768]]       [1, 197, 768]          1,536     
    Linear-7       [[1, 197, 768]]       [1, 197, 3072]       2,362,368   
     GELU-2        [[1, 197, 3072]]      [1, 197, 3072]           0       
   Dropout-8       [[1, 197, 3072]]      [1, 197, 3072]           0       
    Linear-8       [[1, 197, 3072]]      [1, 197, 768]        2,360,064   
   Dropout-9       [[1, 197, 768]]       [1, 197, 768]            0       
     Mlp-2         [[1, 197, 768]]       [1, 197, 768]            0       
 EncoderLayer-2    [[1, 197, 768]]       [1, 197, 768]            0       
  LayerNorm-5      [[1, 197, 768]]       [1, 197, 768]          1,536     
    Linear-9       [[1, 197, 768]]       [1, 197, 2304]       1,771,776   
   Softmax-3     [[1, 12, 197, 197]]   [1, 12, 197, 197]          0       
   Dropout-10    [[1, 12, 197, 197]]   [1, 12, 197, 197]          0       
   Linear-10       [[1, 197, 768]]       [1, 197, 768]         590,592    
   Dropout-11      [[1, 197, 768]]       [1, 197, 768]            0       
  Attention-3      [[1, 197, 768]]       [1, 197, 768]            0       
  LayerNorm-6      [[1, 197, 768]]       [1, 197, 768]          1,536     
   Linear-11       [[1, 197, 768]]       [1, 197, 3072]       2,362,368   
     GELU-3        [[1, 197, 3072]]      [1, 197, 3072]           0       
   Dropout-12      [[1, 197, 3072]]      [1, 197, 3072]           0       
   Linear-12       [[1, 197, 3072]]      [1, 197, 768]        2,360,064   
   Dropout-13      [[1, 197, 768]]       [1, 197, 768]            0       
     Mlp-3         [[1, 197, 768]]       [1, 197, 768]            0       
 EncoderLayer-3    [[1, 197, 768]]       [1, 197, 768]            0       
  LayerNorm-7      [[1, 197, 768]]       [1, 197, 768]          1,536     
   Linear-13       [[1, 197, 768]]       [1, 197, 2304]       1,771,776   
   Softmax-4     [[1, 12, 197, 197]]   [1, 12, 197, 197]          0       
   Dropout-14    [[1, 12, 197, 197]]   [1, 12, 197, 197]          0       
   Linear-14       [[1, 197, 768]]       [1, 197, 768]         590,592    
   Dropout-15      [[1, 197, 768]]       [1, 197, 768]            0       
  Attention-4      [[1, 197, 768]]       [1, 197, 768]            0       
  LayerNorm-8      [[1, 197, 768]]       [1, 197, 768]          1,536     
   Linear-15       [[1, 197, 768]]       [1, 197, 3072]       2,362,368   
     GELU-4        [[1, 197, 3072]]      [1, 197, 3072]           0       
   Dropout-16      [[1, 197, 3072]]      [1, 197, 3072]           0       
   Linear-16       [[1, 197, 3072]]      [1, 197, 768]        2,360,064   
   Dropout-17      [[1, 197, 768]]       [1, 197, 768]            0       
     Mlp-4         [[1, 197, 768]]       [1, 197, 768]            0       
 EncoderLayer-4    [[1, 197, 768]]       [1, 197, 768]            0       
  LayerNorm-9      [[1, 197, 768]]       [1, 197, 768]          1,536     
   Linear-17       [[1, 197, 768]]       [1, 197, 2304]       1,771,776   
   Softmax-5     [[1, 12, 197, 197]]   [1, 12, 197, 197]          0       
   Dropout-18    [[1, 12, 197, 197]]   [1, 12, 197, 197]          0       
   Linear-18       [[1, 197, 768]]       [1, 197, 768]         590,592    
   Dropout-19      [[1, 197, 768]]       [1, 197, 768]            0       
  Attention-5      [[1, 197, 768]]       [1, 197, 768]            0       
  LayerNorm-10     [[1, 197, 768]]       [1, 197, 768]          1,536     
   Linear-19       [[1, 197, 768]]       [1, 197, 3072]       2,362,368   
     GELU-5        [[1, 197, 3072]]      [1, 197, 3072]           0       
   Dropout-20      [[1, 197, 3072]]      [1, 197, 3072]           0       
   Linear-20       [[1, 197, 3072]]      [1, 197, 768]        2,360,064   
   Dropout-21      [[1, 197, 768]]       [1, 197, 768]            0       
     Mlp-5         [[1, 197, 768]]       [1, 197, 768]            0       
 EncoderLayer-5    [[1, 197, 768]]       [1, 197, 768]            0       
  LayerNorm-11     [[1, 197, 768]]       [1, 197, 768]          1,536     
   Linear-21       [[1, 197, 768]]       [1, 197, 2304]       1,771,776   
   Softmax-6     [[1, 12, 197, 197]]   [1, 12, 197, 197]          0       
   Dropout-22    [[1, 12, 197, 197]]   [1, 12, 197, 197]          0       
   Linear-22       [[1, 197, 768]]       [1, 197, 768]         590,592    
   Dropout-23      [[1, 197, 768]]       [1, 197, 768]            0       
  Attention-6      [[1, 197, 768]]       [1, 197, 768]            0       
  LayerNorm-12     [[1, 197, 768]]       [1, 197, 768]          1,536     
   Linear-23       [[1, 197, 768]]       [1, 197, 3072]       2,362,368   
     GELU-6        [[1, 197, 3072]]      [1, 197, 3072]           0       
   Dropout-24      [[1, 197, 3072]]      [1, 197, 3072]           0       
   Linear-24       [[1, 197, 3072]]      [1, 197, 768]        2,360,064   
   Dropout-25      [[1, 197, 768]]       [1, 197, 768]            0       
     Mlp-6         [[1, 197, 768]]       [1, 197, 768]            0       
 EncoderLayer-6    [[1, 197, 768]]       [1, 197, 768]            0       
  LayerNorm-13     [[1, 197, 768]]       [1, 197, 768]          1,536     
   Linear-25       [[1, 197, 768]]       [1, 197, 2304]       1,771,776   
   Softmax-7     [[1, 12, 197, 197]]   [1, 12, 197, 197]          0       
   Dropout-26    [[1, 12, 197, 197]]   [1, 12, 197, 197]          0       
   Linear-26       [[1, 197, 768]]       [1, 197, 768]         590,592    
   Dropout-27      [[1, 197, 768]]       [1, 197, 768]            0       
  Attention-7      [[1, 197, 768]]       [1, 197, 768]            0       
  LayerNorm-14     [[1, 197, 768]]       [1, 197, 768]          1,536     
   Linear-27       [[1, 197, 768]]       [1, 197, 3072]       2,362,368   
     GELU-7        [[1, 197, 3072]]      [1, 197, 3072]           0       
   Dropout-28      [[1, 197, 3072]]      [1, 197, 3072]           0       
   Linear-28       [[1, 197, 3072]]      [1, 197, 768]        2,360,064   
   Dropout-29      [[1, 197, 768]]       [1, 197, 768]            0       
     Mlp-7         [[1, 197, 768]]       [1, 197, 768]            0       
 EncoderLayer-7    [[1, 197, 768]]       [1, 197, 768]            0       
  LayerNorm-15     [[1, 197, 768]]       [1, 197, 768]          1,536     
   Linear-29       [[1, 197, 768]]       [1, 197, 2304]       1,771,776   
   Softmax-8     [[1, 12, 197, 197]]   [1, 12, 197, 197]          0       
   Dropout-30    [[1, 12, 197, 197]]   [1, 12, 197, 197]          0       
   Linear-30       [[1, 197, 768]]       [1, 197, 768]         590,592    
   Dropout-31      [[1, 197, 768]]       [1, 197, 768]            0       
  Attention-8      [[1, 197, 768]]       [1, 197, 768]            0       
  LayerNorm-16     [[1, 197, 768]]       [1, 197, 768]          1,536     
   Linear-31       [[1, 197, 768]]       [1, 197, 3072]       2,362,368   
     GELU-8        [[1, 197, 3072]]      [1, 197, 3072]           0       
   Dropout-32      [[1, 197, 3072]]      [1, 197, 3072]           0       
   Linear-32       [[1, 197, 3072]]      [1, 197, 768]        2,360,064   
   Dropout-33      [[1, 197, 768]]       [1, 197, 768]            0       
     Mlp-8         [[1, 197, 768]]       [1, 197, 768]            0       
 EncoderLayer-8    [[1, 197, 768]]       [1, 197, 768]            0       
  LayerNorm-17     [[1, 197, 768]]       [1, 197, 768]          1,536     
   Linear-33       [[1, 197, 768]]       [1, 197, 2304]       1,771,776   
   Softmax-9     [[1, 12, 197, 197]]   [1, 12, 197, 197]          0       
   Dropout-34    [[1, 12, 197, 197]]   [1, 12, 197, 197]          0       
   Linear-34       [[1, 197, 768]]       [1, 197, 768]         590,592    
   Dropout-35      [[1, 197, 768]]       [1, 197, 768]            0       
  Attention-9      [[1, 197, 768]]       [1, 197, 768]            0       
  LayerNorm-18     [[1, 197, 768]]       [1, 197, 768]          1,536     
   Linear-35       [[1, 197, 768]]       [1, 197, 3072]       2,362,368   
     GELU-9        [[1, 197, 3072]]      [1, 197, 3072]           0       
   Dropout-36      [[1, 197, 3072]]      [1, 197, 3072]           0       
   Linear-36       [[1, 197, 3072]]      [1, 197, 768]        2,360,064   
   Dropout-37      [[1, 197, 768]]       [1, 197, 768]            0       
     Mlp-9         [[1, 197, 768]]       [1, 197, 768]            0       
 EncoderLayer-9    [[1, 197, 768]]       [1, 197, 768]            0       
  LayerNorm-19     [[1, 197, 768]]       [1, 197, 768]          1,536     
   Linear-37       [[1, 197, 768]]       [1, 197, 2304]       1,771,776   
   Softmax-10    [[1, 12, 197, 197]]   [1, 12, 197, 197]          0       
   Dropout-38    [[1, 12, 197, 197]]   [1, 12, 197, 197]          0       
   Linear-38       [[1, 197, 768]]       [1, 197, 768]         590,592    
   Dropout-39      [[1, 197, 768]]       [1, 197, 768]            0       
  Attention-10     [[1, 197, 768]]       [1, 197, 768]            0       
  LayerNorm-20     [[1, 197, 768]]       [1, 197, 768]          1,536     
   Linear-39       [[1, 197, 768]]       [1, 197, 3072]       2,362,368   
    GELU-10        [[1, 197, 3072]]      [1, 197, 3072]           0       
   Dropout-40      [[1, 197, 3072]]      [1, 197, 3072]           0       
   Linear-40       [[1, 197, 3072]]      [1, 197, 768]        2,360,064   
   Dropout-41      [[1, 197, 768]]       [1, 197, 768]            0       
     Mlp-10        [[1, 197, 768]]       [1, 197, 768]            0       
EncoderLayer-10    [[1, 197, 768]]       [1, 197, 768]            0       
  LayerNorm-21     [[1, 197, 768]]       [1, 197, 768]          1,536     
   Linear-41       [[1, 197, 768]]       [1, 197, 2304]       1,771,776   
   Softmax-11    [[1, 12, 197, 197]]   [1, 12, 197, 197]          0       
   Dropout-42    [[1, 12, 197, 197]]   [1, 12, 197, 197]          0       
   Linear-42       [[1, 197, 768]]       [1, 197, 768]         590,592    
   Dropout-43      [[1, 197, 768]]       [1, 197, 768]            0       
  Attention-11     [[1, 197, 768]]       [1, 197, 768]            0       
  LayerNorm-22     [[1, 197, 768]]       [1, 197, 768]          1,536     
   Linear-43       [[1, 197, 768]]       [1, 197, 3072]       2,362,368   
    GELU-11        [[1, 197, 3072]]      [1, 197, 3072]           0       
   Dropout-44      [[1, 197, 3072]]      [1, 197, 3072]           0       
   Linear-44       [[1, 197, 3072]]      [1, 197, 768]        2,360,064   
   Dropout-45      [[1, 197, 768]]       [1, 197, 768]            0       
     Mlp-11        [[1, 197, 768]]       [1, 197, 768]            0       
EncoderLayer-11    [[1, 197, 768]]       [1, 197, 768]            0       
  LayerNorm-23     [[1, 197, 768]]       [1, 197, 768]          1,536     
   Linear-45       [[1, 197, 768]]       [1, 197, 2304]       1,771,776   
   Softmax-12    [[1, 12, 197, 197]]   [1, 12, 197, 197]          0       
   Dropout-46    [[1, 12, 197, 197]]   [1, 12, 197, 197]          0       
   Linear-46       [[1, 197, 768]]       [1, 197, 768]         590,592    
   Dropout-47      [[1, 197, 768]]       [1, 197, 768]            0       
  Attention-12     [[1, 197, 768]]       [1, 197, 768]            0       
  LayerNorm-24     [[1, 197, 768]]       [1, 197, 768]          1,536     
   Linear-47       [[1, 197, 768]]       [1, 197, 3072]       2,362,368   
    GELU-12        [[1, 197, 3072]]      [1, 197, 3072]           0       
   Dropout-48      [[1, 197, 3072]]      [1, 197, 3072]           0       
   Linear-48       [[1, 197, 3072]]      [1, 197, 768]        2,360,064   
   Dropout-49      [[1, 197, 768]]       [1, 197, 768]            0       
     Mlp-12        [[1, 197, 768]]       [1, 197, 768]            0       
EncoderLayer-12    [[1, 197, 768]]       [1, 197, 768]            0       
  LayerNorm-25     [[1, 197, 768]]       [1, 197, 768]          1,536     
   Encoder-1       [[1, 197, 768]]       [1, 197, 768]            0       
   Linear-49          [[1, 768]]            [1, 768]           590,592    
     ReLU-1           [[1, 768]]            [1, 768]              0       
   Dropout-50         [[1, 768]]            [1, 768]              0       
   Linear-50          [[1, 768]]           [1, 1000]           769,000    
   Dropout-51        [[1, 1000]]           [1, 1000]              0       
   Classify-1         [[1, 768]]           [1, 1000]              0       
============================================================================
Total params: 87,158,248
Trainable params: 87,158,248
Non-trainable params: 0
----------------------------------------------------------------------------
Input size (MB): 0.57
Forward/backward pass size (MB): 423.52
Params size (MB): 332.48
Estimated Total Size (MB): 756.57
----------------------------------------------------------------------------

Hi, I’m chenpan
I’m currently studying Deep Learning
I’m interested in Computer Vision
️ Huazhong University of Science and Technology
Studying with me:
AI Studio and CSDN

本文链接：https://blog.csdn.net/m0_63642362/article/details/121848326

原作者删帖不实内容删帖广告或垃圾文章投诉

智能推荐

C#连接OPC C#上位机链接PLC程序源码 1.该程序是通讯方式是CSharp通过OPC方式连接PLC_c#opc通信-程序员宅基地

文章浏览阅读565次。本文主要介绍如何使用C#通过OPC方式连接PLC，并提供了相应的程序和学习资料，以便读者学习和使用。OPC服务器是一种软件，可以将PLC的数据转换为标准的OPC格式，允许其他软件通过标准接口读取或控制PLC的数据。此外，本文还提供了一些学习资料，包括OPC和PLC的基础知识，C#编程语言的教程和实例代码。这些资料可以帮助读者更好地理解和应用本文介绍的程序。1.该程序是通讯方式是CSharp通过OPC方式连接PLC，用这种方式连PLC不用考虑什么种类PLC，只要OPC服务器里有的PLC都可以连。_c#opc通信

Hyper-V内的虚拟机复制粘贴_win10 hyper-v ubuntu18.04 文件拷贝-程序员宅基地

文章浏览阅读1.6w次，点赞3次，收藏10次。实践环境物理机：Windows10教育版，操作系统版本 17763.914虚拟机：Ubuntu18.04.3桌面版在Hyper-V中的刚安装好Ubuntu虚拟机之后，会发现鼠标滑动很不顺畅，也不能向虚拟机中拖拽文件或者复制内容。在VMware中，可以通过安装VMware tools来使物理机和虚拟机之间达到更好的交互。在Hyper-V中，也有这样的工具。这款工具可以完成更好的鼠标交互，我的..._win10 hyper-v ubuntu18.04 文件拷贝

java静态变量初始化多线程，持续更新中_类初始化一个静态属性为线程池-程序员宅基地

文章浏览阅读156次。前言互联网时代，瞬息万变。一个小小的走错，就有可能落后于别人。我们没办法去预测任何行业、任何职业未来十年会怎么样，因为未来谁都不能确定。只能说只要有互联网存在，程序员依然是个高薪热门行业。只要跟随着时代的脚步，学习新的知识。程序员是不可能会消失的，或者说不可能会没钱赚的。我们经常可以听到很多人说，程序员是一个吃青春饭的行当。因为大多数人认为这是一个需要高强度脑力劳动的工种，而30岁、40岁，甚至50岁的程序员身体机能逐渐弱化，家庭琐事缠身，已经不能再进行这样高强度的工作了。那么，这样的说法是对的么？_类初始化一个静态属性为线程池

idea 配置maven，其实不用单独下载Maven的。以及设置新项目配置，省略每次创建新项目都要配置一次Maven_安装idea后是不是不需要安装maven了?-程序员宅基地

文章浏览阅读1w次，点赞13次，收藏43次。说来也是惭愧，一直以来，在装环境的时候都会从官网下载Maven。然后再在idea里配置Maven。以为从官网下载的Maven是必须的步骤，直到今天才得知，idea有捆绑的 Maven 我们只需要搞一个配置文件就行了无需再官网下载Maven包以后再在新电脑装环境的时候，只需要下载idea ，网上找一个Maven的配置文件放到默认的包下面就可以了！也省得每次创建项目都要重新配一次Maven了。如果不想每次新建项目都要重新配置Maven，一种方法就是使用默认的配置，另一种方法就是配置 .._安装idea后是不是不需要安装maven了?

奶爸奶妈必看给宝宝摄影大全-程序员宅基地

文章浏览阅读45次。家是我们一生中最重要的地方,小时候,我们在这里哭、在这里笑、在这里学习走路,在这里有我们最真实的时光,用相机把它记下吧。　　很多家庭在拍摄孩子时有一个看法,认为儿童摄影团购必须是在风景秀丽的户外,即便是室内那也是像大酒店一样...

构建Docker镜像指南，含实战案例_rocker/r-base镜像-程序员宅基地

文章浏览阅读429次。Dockerfile介绍Dockerfile是构建镜像的指令文件，由一组指令组成，文件中每条指令对应linux中一条命令，在执行构建Docker镜像时，将读取Dockerfile中的指令，根据指令来操作生成指定Docker镜像。Dockerfile结构：主要由基础镜像信息、维护者信息、镜像操作指令、容器启动时执行指令。每行支持一条指令，每条指令可以携带多个参数。注释可以使用#开头。指令说明FROM 镜像：指定新的镜像所基于的镜像MAINTAINER 名字：说明新镜像的维护（制作）人，留下_rocker/r-base镜像

随便推点

毕设基于微信小程序的小区管理系统的设计ssm毕业设计_ssm基于微信小程序的公寓生活管理系统-程序员宅基地

文章浏览阅读223次。该系统将提供便捷的信息发布、物业报修、社区互动等功能，为小区居民提供更加便利、高效的服务。引言：随着城市化进程的加速，小区管理成为一个日益重要的任务。因此，设计一个基于微信小程序的小区管理系统成为了一项具有挑战性和重要性的毕设课题。本文将介绍该小区管理系统的设计思路和功能，以期为小区提供更便捷、高效的管理手段。四、总结与展望：通过本次毕设项目，我们实现了一个基于微信小程序的小区管理系统，为小区居民提供了更加便捷、高效的服务。通过该系统的设计与实现，能够提高小区管理水平，提供更好的居住环境和服务。_ssm基于微信小程序的公寓生活管理系统

如何正确的使用Ubuntu以及安装常用的渗透工具集.-程序员宅基地

文章浏览阅读635次。文章来源i春秋入坑Ubuntu半年多了记得一开始学的时候基本一星期重装三四次=-= 尴尬了觉得自己差不多可以的时候就吧Windows10干掉了 c盘装Ubuntu 专心学习. 这里主要来说一下使用Ubuntu的正确姿势Ubuntu（友帮拓、优般图、乌班图）是一个以桌面应用为主的开源GNU/Linux操作系统，Ubuntu 是基于DebianGNU/Linux，支..._ubuntu安装攻击工具包

JNI参数传递引用_jni引用byte[]-程序员宅基地

文章浏览阅读335次。需求：C++中将BYTE型数组传递给Java中，考虑到内存释放问题，未采用通过返回值进行数据传递。public class demoClass{public native boolean getData(byte[] tempData);}JNIEXPORT jboolean JNICALL Java_com_core_getData(JNIEnv *env, jobject thisObj, jbyteArray tempData){ //resultsize为s..._jni引用byte[]

三维重建工具——pclpy教程之点云分割_pclpy.pcl.pointcloud.pointxyzi转为numpy-程序员宅基地

文章浏览阅读2.1k次，点赞5次，收藏30次。本教程代码开源：GitHub 欢迎star文章目录一、平面模型分割1. 代码2. 说明3. 运行二、圆柱模型分割1. 代码2. 说明3. 运行三、欧几里得聚类提取1. 代码2. 说明3. 运行四、区域生长分割1. 代码2. 说明3. 运行五、基于最小切割的分割1. 代码2. 说明3. 运行六、使用 ProgressiveMorphologicalFilter 分割地面1. 代码2. 说明3. 运行一、平面模型分割在本教程中，我们将学习如何对一组点进行简单的平面分割，即找到支持平面模型的点云中的所有._pclpy.pcl.pointcloud.pointxyzi转为numpy

以NFS启动方式构建arm-linux仿真运行环境-程序员宅基地

文章浏览阅读141次。一其实在 skyeye 上移植 arm-linux 并非难事,网上也有不少资料, 只是大都遗漏细节, 以致细微之处卡壳，所以本文力求详实清析, 希望能对大家有点用处。本文旨在将 arm-linux 在 skyeye 上搭建起来，并在 arm-linux 上能成功 mount NFS 为目标, 最终我们能在 arm-linux 里运行我们自己的应用程序. 二安装 Sky..._nfs启动 arm

攻防世界 Pwn 进阶第二页_pwn snprintf-程序员宅基地

文章浏览阅读598次，点赞2次，收藏5次。00为了形成一个体系，想将前面学过的一些东西都拉来放在一起总结总结，方便学习，方便记忆。攻防世界 Pwn 新手攻防世界 Pwn 进阶第一页01 4-ReeHY-main-100超详细的wp1超详细的wp203 format2栈迁移的两种作用之一：栈溢出太小，进行栈迁移从而能够写入更多shellcode，进行更多操作。栈迁移一篇搞定有个陌生的函数。C 库函数 void *memcpy(void *str1, const void *str2, size_t n) 从存储区 str2 _pwn snprintf