文件
AutogluonModels
房价预测
.gitignore
02.01 数据操作.ipynb
02.02 数据预处理.ipynb
02.03 线性代数.ipynb
02.04 微积分.ipynb
02.05 自动微分.ipynb
02.06 概率.ipynb
03.01 线性回归.ipynb
03.02 线性回归的从零开始实现.ipynb
03.04 softmax回归.ipynb
03.06 softmax回归的从零开始实现.ipynb
03.07 softmax回归的简洁实现.ipynb
04.01 多层感知机.ipynb
04.02 多层感知机的从零开始实现.ipynb
04.03 多层感知机的简洁实现.ipynb
04.04 模型选择、欠拟合和过拟合.ipynb
04.05 权重衰减.ipynb
04.06 暂退法(Dropout).ipynb
04.08 前向传播、反向传播和计算图.ipynb
04.08 数值稳定性和模型初始化.ipynb
04.09 环境和分布偏移.ipynb
05.01 层和块.ipynb
05.02 参数管理.ipynb
05.04 自定义层.ipynb
05.05 读写文件.ipynb
06.01 从全连接层到卷积.ipynb
06.02 图像卷积.ipynb
06.03 填充和步幅.ipynb
06.04 多输入多输出通道.ipynb
06.05 汇聚层.ipynb
06.06 LeNet.ipynb
07.01 深度卷积神经网络(AlexNex).ipynb
07.02 使用块的网络(VGG).ipynb
07.03 网络中的网络(NiN).ipynb
07.04 含并行连接的网络GoogLeNet.ipynb
07.05 批量规范化.ipynb
08.01 序列模型.ipynb
08.02 文本预处理.ipynb
08.03 语言模型和数据集.ipynb
08.04 循环神经网络.ipynb
08.05 循环神经网络从零开始实现.ipynb
08.06 循环神经网络的简洁实现.ipynb
09.01 门控循环单元.ipynb
09.02 长短期记忆网络(LSTM).ipynb
09.03 深度循环神经网络.ipynb
09.04 双向循环神经网络.ipynb
09.05 机器翻译与数据集.ipynb
09.06 编码器-解码器.ipynb
09.07 序列到序列学习.ipynb
10.1&10.2 注意力提示&注意力汇聚.ipynb
10.3 注意力评分函数.ipynb
10.4 Bahdanau注意力(使用注意力机制的RNN).ipynb
10.5 多头注意力.ipynb
10.6 自注意力和位置编码.ipynb
10.7 Transformer.ipynb
14.1 词嵌入(Word2Vec).ipynb
14.2 近似训练.ipynb
14.3 用于预训练词嵌入的数据集.ipynb
14.4 预训练word2vec.ipynb
14.5 全局向量的词嵌入(Glove).ipynb
14.6 子词嵌入.ipynb
14.7 词的相似性和类比任务.ipynb
14.8 来自Transformers的双向编码器表示(BERT).ipynb
14.9 用于预训练BERT的数据集&14.10预训练BERT.ipynb
15.4 自然语言推断与数据集&15.7微调BERT.ipynb
15.5 自然语言推断:使用注意力.ipynb
15.6. 针对序列级和词元级应用程序微调BERT.ipynb
Python 语法.ipynb
git基本操作与逻辑.ipynb
交换esc和caps键.reg
恢复esc和caps键.reg
找实习.ipynb
番外:信息论.ipynb
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
10.7 Transformer.ipynb 35.76 KB
一键复制 编辑 原始数据 按行查看 历史
Levi 提交于 2年前 . 重读第10章注意力机制

两种decoder

  • autoregressive:一步一步的计算,这一时刻需要用到上一时刻的输出
  • non-autoregressive(NAT):并行计算,一次性把所有输入都给decoder
  • NAT的性能一般比AT要差

为什么要采用multi-head attention?

  • 我们希望每个头都可能会关注输入的不同部分,学到不同的行为(不同的角度) ,然后将不同的行为组合起来, 捕获序列内各种范围的依赖关系 (例如,短距离依赖和长距离依赖关系)

解码器的输出是一个向量,它如何变成一个单词?

  • 解码器的最后全连接层

基于位置的前馈网络(positionwise feed-forward network)

  • 其本质就是一个MLP
  • 它与MLP的区别
    • 没区别

由于我们使用的是值范围在-1和1之间的固定位置编码,因此通过学习得到的输入的嵌入表示的值需要先乘以嵌入维度的平方根进行重新缩放,然后再与位置编码相加。

import math
import pandas as pd
import torch
from torch import nn
from d2l import torch as d2l
#@save
class PositionWiseFFN(nn.Module):
    """基于位置的前馈网络"""
    def __init__(self, ffn_num_input, ffn_num_hiddens, ffn_num_outputs,
                 **kwargs):
        super(PositionWiseFFN, self).__init__(**kwargs)
        self.dense1 = nn.Linear(ffn_num_input, ffn_num_hiddens)
        self.relu = nn.ReLU()
        self.dense2 = nn.Linear(ffn_num_hiddens, ffn_num_outputs)

        #X:(batch_size, num_step, num_hidden)
    def forward(self, X):
        return self.dense2(self.relu(self.dense1(X)))
ffn = PositionWiseFFN(4, 4, 8)
ffn.eval()
ffn(torch.ones((2, 3, 4))).shape
torch.Size([2, 3, 8])
ln = nn.LayerNorm(2)
bn = nn.BatchNorm1d(2)
X = torch.tensor([[1, 2], [2, 3]], dtype=torch.float32)
# 在训练模式下计算X的均值和方差
print('layer norm:', ln(X), '\nbatch norm:', bn(X))
layer norm: tensor([[-1.0000,  1.0000],
        [-1.0000,  1.0000]], grad_fn=<NativeLayerNormBackward0>) 
batch norm: tensor([[-1.0000, -1.0000],
        [ 1.0000,  1.0000]], grad_fn=<NativeBatchNormBackward0>)
#@save
class AddNorm(nn.Module):
    """残差连接后进行层规范化"""
    def __init__(self, normalized_shape, dropout, **kwargs):
        super(AddNorm, self).__init__(**kwargs)
        self.dropout = nn.Dropout(dropout)
        self.ln = nn.LayerNorm(normalized_shape)

    def forward(self, X, Y):
        return self.ln(self.dropout(Y) + X)
add_norm = AddNorm([3, 4], 0.5)
add_norm.eval()
add_norm(torch.ones((2, 3, 4)), torch.ones((2, 3, 4)))
tensor([[[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]],

        [[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]]], grad_fn=<NativeLayerNormBackward0>)
#@save
class EncoderBlock(nn.Module):
    """transformer编码器块"""
    def __init__(self, key_size, query_size, value_size, num_hiddens,
                 norm_shape, ffn_num_input, ffn_num_hiddens, num_heads,
                 dropout, use_bias=False, **kwargs):
        super(EncoderBlock, self).__init__(**kwargs)
        self.attention = d2l.MultiHeadAttention(
            key_size, query_size, value_size, num_hiddens, num_heads, dropout,
            use_bias)
        self.addnorm1 = AddNorm(norm_shape, dropout)
        self.ffn = PositionWiseFFN(ffn_num_input, ffn_num_hiddens, num_hiddens)
        self.addnorm2 = AddNorm(norm_shape, dropout)

    def forward(self, X, valid_lens):
        Y = self.addnorm1(X, self.attention(X, X, X, valid_lens))
        return self.addnorm2(Y, self.ffn(Y))
X = torch.ones((2, 100, 24))
valid_lens = torch.tensor([3, 2])
encoder_blk = EncoderBlock(24, 24, 24, 24, [100, 24], 24, 48, 8, 0.5)
encoder_blk.eval()
encoder_blk(X, valid_lens).shape
torch.Size([2, 100, 24])
#@save
class TransformerEncoder(d2l.Encoder):
    """transformer编码器"""
    def __init__(self, vocab_size, key_size, query_size, value_size,
                 num_hiddens, norm_shape, ffn_num_input, ffn_num_hiddens,
                 num_heads, num_layers, dropout, use_bias=False, **kwargs):
        super(TransformerEncoder, self).__init__(**kwargs)
        self.num_hiddens = num_hiddens
        self.embedding = nn.Embedding(vocab_size, num_hiddens)
        self.pos_encoding = d2l.PositionalEncoding(num_hiddens, dropout)
        self.blks = nn.Sequential()
        for i in range(num_layers):
            self.blks.add_module("block"+str(i),
                EncoderBlock(key_size, query_size, value_size, num_hiddens,
                             norm_shape, ffn_num_input, ffn_num_hiddens,
                             num_heads, dropout, use_bias))

    def forward(self, X, valid_lens, *args):
        # 因为位置编码值在-1和1之间,
        # 因此嵌入值乘以嵌入维度的平方根进行缩放,
        # 然后再与位置编码相加。
        X = self.pos_encoding(self.embedding(X) * math.sqrt(self.num_hiddens))
        self.attention_weights = [None] * len(self.blks)
        for i, blk in enumerate(self.blks):
            X = blk(X, valid_lens)
            self.attention_weights[i] = blk.attention.attention.attention_weights
        return X
encoder = TransformerEncoder(
    200, 24, 24, 24, 24, [100, 24], 24, 48, 8, 2, 0.5)
encoder.eval()
encoder(torch.ones((2, 100), dtype=torch.long), valid_lens).shape
torch.Size([2, 100, 24])
class DecoderBlock(nn.Module):
    """解码器中第i个块"""
    def __init__(self, key_size, query_size, value_size, num_hiddens,
                 norm_shape, ffn_num_input, ffn_num_hiddens, num_heads,
                 dropout, i, **kwargs):
        super(DecoderBlock, self).__init__(**kwargs)
        self.i = i
        self.attention1 = d2l.MultiHeadAttention(
            key_size, query_size, value_size, num_hiddens, num_heads, dropout)
        self.addnorm1 = AddNorm(norm_shape, dropout)
        self.attention2 = d2l.MultiHeadAttention(
            key_size, query_size, value_size, num_hiddens, num_heads, dropout)
        self.addnorm2 = AddNorm(norm_shape, dropout)
        self.ffn = PositionWiseFFN(ffn_num_input, ffn_num_hiddens,
                                   num_hiddens)
        self.addnorm3 = AddNorm(norm_shape, dropout)

    def forward(self, X, state):
        enc_outputs, enc_valid_lens = state[0], state[1]
        # 训练阶段,输出序列的所有词元都在同一时间处理,
        # 因此state[2][self.i]初始化为None。
        # 预测阶段,输出序列是通过词元一个接着一个解码的,
        # 因此state[2][self.i]包含着直到当前时间步第i个块解码的输出表示
        if state[2][self.i] is None:
            key_values = X
        else:
            key_values = torch.cat((state[2][self.i], X), axis=1)
        state[2][self.i] = key_values
        if self.training:
            batch_size, num_steps, _ = X.shape
            # dec_valid_lens的开头:(batch_size,num_steps),
            # 其中每一行是[1,2,...,num_steps]
            dec_valid_lens = torch.arange(
                1, num_steps + 1, device=X.device).repeat(batch_size, 1)
        else:
            dec_valid_lens = None

        # 自注意力
        X2 = self.attention1(X, key_values, key_values, dec_valid_lens)
        Y = self.addnorm1(X, X2)
        # 编码器-解码器注意力。
        # enc_outputs的开头:(batch_size,num_steps,num_hiddens)
        Y2 = self.attention2(Y, enc_outputs, enc_outputs, enc_valid_lens)
        Z = self.addnorm2(Y, Y2)
        return self.addnorm3(Z, self.ffn(Z)), state
decoder_blk = DecoderBlock(24, 24, 24, 24, [100, 24], 24, 48, 8, 0.5, 0)
decoder_blk.eval()
X = torch.ones((2, 100, 24))
state = [encoder_blk(X, valid_lens), valid_lens, [None]]
decoder_blk(X, state)[0].shape
torch.Size([2, 100, 24])
class TransformerDecoder(d2l.AttentionDecoder):
    def __init__(self, vocab_size, key_size, query_size, value_size,
                 num_hiddens, norm_shape, ffn_num_input, ffn_num_hiddens,
                 num_heads, num_layers, dropout, **kwargs):
        super(TransformerDecoder, self).__init__(**kwargs)
        self.num_hiddens = num_hiddens
        self.num_layers = num_layers
        self.embedding = nn.Embedding(vocab_size, num_hiddens)
        self.pos_encoding = d2l.PositionalEncoding(num_hiddens, dropout)
        self.blks = nn.Sequential()
        for i in range(num_layers):
            self.blks.add_module("block"+str(i),
                DecoderBlock(key_size, query_size, value_size, num_hiddens,
                             norm_shape, ffn_num_input, ffn_num_hiddens,
                             num_heads, dropout, i))
        self.dense = nn.Linear(num_hiddens, vocab_size)

    def init_state(self, enc_outputs, enc_valid_lens, *args):
        return [enc_outputs, enc_valid_lens, [None] * self.num_layers]

    def forward(self, X, state):
        X = self.pos_encoding(self.embedding(X) * math.sqrt(self.num_hiddens))
        self._attention_weights = [[None] * len(self.blks) for _ in range (2)]
        for i, blk in enumerate(self.blks):
            X, state = blk(X, state)
            # 解码器自注意力权重
            self._attention_weights[0][
                i] = blk.attention1.attention.attention_weights
            # “编码器-解码器”自注意力权重
            self._attention_weights[1][
                i] = blk.attention2.attention.attention_weights
        return self.dense(X), state

    @property
    def attention_weights(self):
        return self._attention_weights
num_hiddens, num_layers, dropout, batch_size, num_steps = 32, 2, 0.1, 64, 10
lr, num_epochs, device = 0.005, 200, d2l.try_gpu()
ffn_num_input, ffn_num_hiddens, num_heads = 32, 64, 4
key_size, query_size, value_size = 32, 32, 32
norm_shape = [32]

train_iter, src_vocab, tgt_vocab = d2l.load_data_nmt(batch_size, num_steps)

encoder = TransformerEncoder(
    len(src_vocab), key_size, query_size, value_size, num_hiddens,
    norm_shape, ffn_num_input, ffn_num_hiddens, num_heads,
    num_layers, dropout)
decoder = TransformerDecoder(
    len(tgt_vocab), key_size, query_size, value_size, num_hiddens,
    norm_shape, ffn_num_input, ffn_num_hiddens, num_heads,
    num_layers, dropout)
net = d2l.EncoderDecoder(encoder, decoder)
d2l.train_seq2seq(net, train_iter, lr, num_epochs, tgt_vocab, device)
loss 0.032, 5238.0 tokens/sec on cpu
2022-04-25T19:35:41.028389 image/svg+xml Matplotlib v3.4.0, https://matplotlib.org/
Loading...
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化