Tokens-to-Token (T2T)
torchmil.nn.transformers.T2TLayer
Bases: Module
Tokens-to-Token (T2T) Transformer layer from Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
__init__(in_dim, out_dim=None, att_dim=512, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(1, 1), n_heads=4, use_mlp=True, dropout=0.0)
Parameters:
-
in_dim(int) –Input dimension.
-
out_dim(int, default:None) –Output dimension. If None, output dimension will be
kernel_size[0] * kernel_size[1] * att_dim. -
att_dim(int, default:512) –Attention dimension.
-
kernel_size(tuple[int, int], default:(3, 3)) –Kernel size.
-
stride(tuple[int, int], default:(1, 1)) –Stride.
-
padding(tuple[int, int], default:(2, 2)) –Padding.
-
dilation(tuple[int, int], default:(1, 1)) –Dilation.
-
n_heads(int, default:4) –Number of heads.
-
use_mlp(bool, default:True) –Whether to use feedforward layer.
-
dropout(float, default:0.0) –Dropout rate.
forward(X)
Parameters:
-
X(Tensor) –Input tensor of shape
(batch_size, seq_len, in_dim).
Returns:
Y: Output tensor of shape (batch_size, new_seq_len, out_dim). If out_dim is None, out_dim will be att_dim * kernel_size[0] * kernel_size[1].