MCPcopy
hub / github.com/deepspeedai/DeepSpeedExamples / BertModel

Class BertModel

bing_bert/nvidia/modeling.py:753–833  ·  view source on GitHub ↗

BERT model ("Bidirectional Embedding Representations from a Transformer"). Params: config: a BertConfig class instance with the configuration to build a new model Inputs: `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] with the word token

Source from the content-addressed store, hash-verified

751
752
753class BertModel(BertPreTrainedModel):
754 """BERT model ("Bidirectional Embedding Representations from a Transformer").
755
756 Params:
757 config: a BertConfig class instance with the configuration to build a new model
758
759 Inputs:
760 `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
761 with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
762 `extract_features.py`, `run_classifier.py` and `run_squad.py`)
763 `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
764 types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
765 a `sentence B` token (see BERT paper for more details).
766 `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
767 selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
768 input sequence length in the current batch. It's the mask that we typically use for attention when
769 a batch has varying length sentences.
770 `output_all_encoded_layers`: boolean which controls the content of the `encoded_layers` output as described below. Default: `True`.
771
772 Outputs: Tuple of (encoded_layers, pooled_output)
773 `encoded_layers`: controled by `output_all_encoded_layers` argument:
774 - `output_all_encoded_layers=True`: outputs a list of the full sequences of encoded-hidden-states at the end
775 of each attention block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each
776 encoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
777 - `output_all_encoded_layers=False`: outputs only the full sequence of hidden-states corresponding
778 to the last attention block of shape [batch_size, sequence_length, hidden_size],
779 `pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a
780 classifier pretrained on top of the hidden state associated to the first character of the
781 input (`CLS`) to train on the Next-Sentence task (see BERT's paper).
782
783 Example usage:
784 ```python
785 # Already been converted into WordPiece token ids
786 input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
787 input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
788 token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
789
790 config = modeling.BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
791 num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
792
793 model = modeling.BertModel(config=config)
794 all_encoder_layers, pooled_output = model(input_ids, token_type_ids, input_mask)
795 ```
796 """
797 def __init__(self, config):
798 super(BertModel, self).__init__(config)
799 self.embeddings = BertEmbeddings(config)
800 self.encoder = BertEncoder(config)
801 self.pooler = BertPooler(config)
802 self.apply(self.init_bert_weights)
803
804 def forward(self, input_ids, token_type_ids=None, attention_mask=None, output_all_encoded_layers=True, checkpoint_activations=False):
805 if attention_mask is None:
806 attention_mask = torch.ones_like(input_ids)
807 if token_type_ids is None:
808 token_type_ids = torch.zeros_like(input_ids)
809
810 # We create a 3D attention mask from a 2D tensor mask.

Callers 7

__init__Method · 0.70
__init__Method · 0.70
__init__Method · 0.70
__init__Method · 0.70
__init__Method · 0.70
__init__Method · 0.70
__init__Method · 0.70

Calls

no outgoing calls

Tested by

no test coverage detected