Bert简介和初探-FlyAI

一、BERT简介

随着BERT在深度学习领域的热门发展，主要在于其在11项NLP任务中超过现有的记录，其越来越受到人们的关注。BERT的主要思想是通过在海量的语料的基础上运行自监督学习方法为单词学习一个好的特征表示。其中自监督学习是指在没有人工标注的数据上运行的监督学习。在得到训练的特征后，对于特定的NLP任务中，我们能够直接使用BERT预训练思想的特征表示作为该任务的词嵌入特征。从某种意义上讲BERT为大家提供了可以在其它任务迁移学习的模型，这个模型可以对于特定的任务微调或者固定其中的部分权重，然后作为下游人的特征提取器。

论文和项目地址：BER项目地址。

Pytorch可以使用的版本：Pytorch版本安装使用

二、BERT的预训练过程和网络结构

1）BERT (Bidirectional Encoder Representations from Transformers)，即BERT中的主要网络结构是Transformers.

Transformer的结构大致讲解一下，它是一个encoder-decoder的结构，其中包含了若干个编码器和解码器。图中左侧部分为编码器，主要是由Multi-Head Attention和全连接构成，其功能是将输入语句转化成特征向量。图中右侧部分是解码器，其输入为编码器的输出和期望的预测的结果，主要是由Masked Multi-Head Attention, Multi-Head Attention 和全连接组成。

2）BERT预训练任务1： MLM(Masked Language Model)

MLM是指在训练语料库的时候随意的从输入预料上【mask】掉一些单词，即使用【mask】字符代替，然后训练模型，通过语句中的上下文来预测该单词是什么，论文中说明MLMx训练过程非常像我们在学习句子、语言时候的完形填空任务。它能使模型在大量预料中学习到单词级别的特征。具体过程就是确定了需要MASK的单词后，80%概率被替换成【MASK】,10%概率被替换成别的词，10%概率保持不变。即例如

80%：my dog is hairy -> my dog is [mask]

10%：my dog is hairy -> my dog is apple

10%：my dog is hairy -> my dog is hairy

3）BERT预训练任务2：NSP(Next Sentence Prediction）

Next Sentence Prediction（NSP）训练过程就是给定的句子B是否是句子A的下一句话。如果是的话则判断输出’IsNext‘，否则输出’NotNext‘。通过对于语料库的这种学习方式能够使模型到段落语义级别的信息。类似于我们做与语文题目时候的段落排序。使得模型具有这种能力。

三、Pytorch版本的BERT使用样例

1. 由于自己训练大量语料库的BERT模型是非常困难的，所以Google已经开源了预训练好的模型，我们只需要对于我们的具体任务fine-turning.

已经提供好的预训练模型有：

BERT-Base, Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters

BERT-Large, Uncased: 24-layer, 1024-hidden, 16-heads, 340M parameters

BERT-Base, Cased: 12-layer, 768-hidden, 12-heads , 110M parameters

BERT-Large, Cased: 24-layer, 1024-hidden, 16-heads, 340M parameters (Not available yet. Needs to be re-generated)

这些模型都是可以直接调用的，具体的美中模型的使用可以参照google使用的说明文档。

2.关于pytorch_pretrained_bert的主要类的介绍和组织结构说明（这个库中自带的也有许多的examples）

BertModel - raw BERT Transformer model (fully pre-trained),

BertForMaskedLM - BERT Transformer with the pre-trained masked language modeling head on top (fully pre-trained),

BertForNextSentencePrediction - BERT Transformer with the pre-trained next sentence prediction classifier on top (fully pre-trained),

BertForPreTraining - BERT Transformer with masked language modeling head and next sentence prediction classifier on top (fully pre-trained),

BertForSequenceClassification - BERT Transformer with a sequence classification head on top (BERT Transformer is pre-trained, the sequence classification head is only initialized and has to be trained),

BertForMultipleChoice - BERT Transformer with a multiple choice head on top (used for task like Swag) (BERT Transformer is pre-trained, the multiple choice classification head is only initialized and has to be trained),

BertForTokenClassification - BERT Transformer with a token classification head on top (BERT Transformer is pre-trained, the token classification head is only initialized and has to be trained),

BertForQuestionAnswering - BERT Transformer with a token classification head on top (BERT Transformer is pre-trained, the token classification head is only initialized and has to be trained).

3. 代码举例说明

1) 使用BertTokenizer 进行分词，转化词向量

import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
# 加载词典
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# 分词(加入特定的分词标识符）
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = tokenizer.tokenize(text)
# 将 token 转为 单词 index
ids= tokenizer.convert_tokens_to_ids(tokenized_text)
# 定义句子 的 segment_ids
seg= [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
# 将 inputs 转为 Tensor
ids= torch.tensor([ids])
seg= torch.tensor([seg])

2) 使用BertModel预训练权重模型得到网络各层的输出

# 加载预训练模型的权重
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()
 
# 使用GPU
ids= ids.to('cuda')
seg= seg.to('cuda')
model.to('cuda')
 
# 得到隐藏层输出结果
with torch.no_grad():
    encoded_layers, _ = model(ids, seg)

3) 根据最后的结构隐藏层我们就可以接上自己的结构，完成下游任务。

参考资料：

[1] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[J]. arXiv preprint arXiv:1810.04805, 2018.

[2] Matthew Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semi-supervised sequence tagging with bidirectional language models. In ACL.

[3] https://github.com/google-research/bert

[4] https://github.com/huggingface/pytorch-pretrained-BERT