最后更新 2020-04-28 15:38 阅读 9654
最后更新 2020-04-28 15:38
阅读 9654
1)BERT (Bidirectional Encoder Representations from Transformers),即BERT中的主要网络结构是Transformers.
Transformer的结构大致讲解一下,它是一个encoder-decoder的结构,其中包含了若干个编码器和解码器。图中左侧部分为编码器,主要是由Multi-Head Attention和全连接构成,其功能是将输入语句转化成特征向量。图中右侧部分是解码器,其输入为编码器的输出和期望的预测的结果,主要是由Masked Multi-Head Attention, Multi-Head Attention 和 全连接组成。
2)BERT预训练任务1: MLM(Masked Language Model)
80%:my dog is hairy -> my dog is [mask]
10%:my dog is hairy -> my dog is apple
10%:my dog is hairy -> my dog is hairy
3)BERT预训练任务2:NSP(Next Sentence Prediction)
Next Sentence Prediction(NSP)训练过程就是给定的句子B是否是句子A的下一句话。如果是的话则判断输出’IsNext‘,否则输出’NotNext‘。通过对于语料库的这种学习方式能够使模型到段落语义级别的信息。类似于我们做与语文题目时候的段落排序。使得模型具有这种能力。
1. 由于自己训练大量语料库的BERT模型是非常困难的,所以Google已经开源了预训练好的模型,我们只需要对于我们的具体任务fine-turning.
BERT-Base, Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Large, Uncased: 24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Cased: 12-layer, 768-hidden, 12-heads , 110M parameters
BERT-Large, Cased: 24-layer, 1024-hidden, 16-heads, 340M parameters (Not available yet. Needs to be re-generated)
BertModel - raw BERT Transformer model (fully pre-trained),
BertForMaskedLM - BERT Transformer with the pre-trained masked language modeling head on top (fully pre-trained),
BertForNextSentencePrediction - BERT Transformer with the pre-trained next sentence prediction classifier on top (fully pre-trained),
BertForPreTraining - BERT Transformer with masked language modeling head and next sentence prediction classifier on top (fully pre-trained),
BertForSequenceClassification - BERT Transformer with a sequence classification head on top (BERT Transformer is pre-trained, the sequence classification head is only initialized and has to be trained),
BertForMultipleChoice - BERT Transformer with a multiple choice head on top (used for task like Swag) (BERT Transformer is pre-trained, the multiple choice classification head is only initialized and has to be trained),
BertForTokenClassification - BERT Transformer with a token classification head on top (BERT Transformer is pre-trained, the token classification head is only initialized and has to be trained),
BertForQuestionAnswering - BERT Transformer with a token classification head on top (BERT Transformer is pre-trained, the token classification head is only initialized and has to be trained).
3. 代码举例说明
1) 使用BertTokenizer 进行分词,转化词向量
import torch from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM # 加载词典 tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') # 分词(加入特定的分词标识符) text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]" tokenized_text = tokenizer.tokenize(text) # 将 token 转为 单词 index ids= tokenizer.convert_tokens_to_ids(tokenized_text) # 定义句子 的 segment_ids seg= [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1] # 将 inputs 转为 Tensor ids= torch.tensor([ids]) seg= torch.tensor([seg])
2) 使用BertModel预训练权重模型得到网络各层的输出
# 加载预训练模型的权重 model = BertModel.from_pretrained('bert-base-uncased') model.eval() # 使用GPU ids='cuda') seg='cuda')'cuda') # 得到隐藏层输出结果 with torch.no_grad(): encoded_layers, _ = model(ids, seg)
3) 根据最后的结构隐藏层我们就可以接上自己的结构,完成下游任务。
[1] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[J]. arXiv preprint arXiv:1810.04805, 2018.
[2] Matthew Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semi-supervised sequence tagging with bidirectional language models. In ACL.