BERT model ("Bidirectional Embedding Representations from a Transformer"). Params: config: a BertConfig class instance with the configuration to build a new model Inputs: `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] with the word token
| 751 | |
| 752 | |
| 753 | class BertModel(BertPreTrainedModel): |
| 754 | """BERT model ("Bidirectional Embedding Representations from a Transformer"). |
| 755 | |
| 756 | Params: |
| 757 | config: a BertConfig class instance with the configuration to build a new model |
| 758 | |
| 759 | Inputs: |
| 760 | `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] |
| 761 | with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts |
| 762 | `extract_features.py`, `run_classifier.py` and `run_squad.py`) |
| 763 | `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token |
| 764 | types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to |
| 765 | a `sentence B` token (see BERT paper for more details). |
| 766 | `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices |
| 767 | selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max |
| 768 | input sequence length in the current batch. It's the mask that we typically use for attention when |
| 769 | a batch has varying length sentences. |
| 770 | `output_all_encoded_layers`: boolean which controls the content of the `encoded_layers` output as described below. Default: `True`. |
| 771 | |
| 772 | Outputs: Tuple of (encoded_layers, pooled_output) |
| 773 | `encoded_layers`: controled by `output_all_encoded_layers` argument: |
| 774 | - `output_all_encoded_layers=True`: outputs a list of the full sequences of encoded-hidden-states at the end |
| 775 | of each attention block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each |
| 776 | encoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, hidden_size], |
| 777 | - `output_all_encoded_layers=False`: outputs only the full sequence of hidden-states corresponding |
| 778 | to the last attention block of shape [batch_size, sequence_length, hidden_size], |
| 779 | `pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a |
| 780 | classifier pretrained on top of the hidden state associated to the first character of the |
| 781 | input (`CLS`) to train on the Next-Sentence task (see BERT's paper). |
| 782 | |
| 783 | Example usage: |
| 784 | ```python |
| 785 | # Already been converted into WordPiece token ids |
| 786 | input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]]) |
| 787 | input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]]) |
| 788 | token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]]) |
| 789 | |
| 790 | config = modeling.BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768, |
| 791 | num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072) |
| 792 | |
| 793 | model = modeling.BertModel(config=config) |
| 794 | all_encoder_layers, pooled_output = model(input_ids, token_type_ids, input_mask) |
| 795 | ``` |
| 796 | """ |
| 797 | def __init__(self, config): |
| 798 | super(BertModel, self).__init__(config) |
| 799 | self.embeddings = BertEmbeddings(config) |
| 800 | self.encoder = BertEncoder(config) |
| 801 | self.pooler = BertPooler(config) |
| 802 | self.apply(self.init_bert_weights) |
| 803 | |
| 804 | def forward(self, input_ids, token_type_ids=None, attention_mask=None, output_all_encoded_layers=True, checkpoint_activations=False): |
| 805 | if attention_mask is None: |
| 806 | attention_mask = torch.ones_like(input_ids) |
| 807 | if token_type_ids is None: |
| 808 | token_type_ids = torch.zeros_like(input_ids) |
| 809 | |
| 810 | # We create a 3D attention mask from a 2D tensor mask. |