hub / github.com/deepspeedai/DeepSpeedExamples / BertModel

Class BertModel

bing_bert/nvidia/modeling.py:753–833 · view source on GitHub ↗

BERT model ("Bidirectional Embedding Representations from a Transformer"). Params: config: a BertConfig class instance with the configuration to build a new model Inputs: `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] with the word token

Source from the content-addressed store, hash-verified

751
752
753	class BertModel(BertPreTrainedModel):
754	"""BERT model ("Bidirectional Embedding Representations from a Transformer").
755
756	Params:
757	config: a BertConfig class instance with the configuration to build a new model
758
759	Inputs:
760	`input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
761	with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
762	`extract_features.py`, `run_classifier.py` and `run_squad.py`)
763	`token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
764	types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
765	a `sentence B` token (see BERT paper for more details).
766	`attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
767	selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
768	input sequence length in the current batch. It's the mask that we typically use for attention when
769	a batch has varying length sentences.
770	`output_all_encoded_layers`: boolean which controls the content of the `encoded_layers` output as described below. Default: `True`.
771
772	Outputs: Tuple of (encoded_layers, pooled_output)
773	`encoded_layers`: controled by `output_all_encoded_layers` argument:
774	- `output_all_encoded_layers=True`: outputs a list of the full sequences of encoded-hidden-states at the end
775	of each attention block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each
776	encoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
777	- `output_all_encoded_layers=False`: outputs only the full sequence of hidden-states corresponding
778	to the last attention block of shape [batch_size, sequence_length, hidden_size],
779	`pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a
780	classifier pretrained on top of the hidden state associated to the first character of the
781	input (`CLS`) to train on the Next-Sentence task (see BERT's paper).
782
783	Example usage:
784	```python
785	# Already been converted into WordPiece token ids
786	input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
787	input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
788	token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
789
790	config = modeling.BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
791	num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
792
793	model = modeling.BertModel(config=config)
794	all_encoder_layers, pooled_output = model(input_ids, token_type_ids, input_mask)
795	```
796	"""
797	def __init__(self, config):
798	super(BertModel, self).__init__(config)
799	self.embeddings = BertEmbeddings(config)
800	self.encoder = BertEncoder(config)
801	self.pooler = BertPooler(config)
802	self.apply(self.init_bert_weights)
803
804	def forward(self, input_ids, token_type_ids=None, attention_mask=None, output_all_encoded_layers=True, checkpoint_activations=False):
805	if attention_mask is None:
806	attention_mask = torch.ones_like(input_ids)
807	if token_type_ids is None:
808	token_type_ids = torch.zeros_like(input_ids)
809
810	# We create a 3D attention mask from a 2D tensor mask.

Callers 7

__init__Method · 0.70

Calls

no outgoing calls

Tested by

no test coverage detected