hub / github.com/tensorlayer/TensorLayer / sentence_to_token_ids

Function sentence_to_token_ids

tensorlayer/nlp.py:1016–1049 · view source on GitHub ↗

Convert a string to list of integers representing token-ids. For example, a sentence "I have a dog" may become tokenized into ["I", "have", "a", "dog"] and with vocabulary {"I": 1, "have": 2, "a": 4, "dog": 7"} this function will return [1, 2, 4, 7]. Parameters -----------

(
    sentence, vocabulary, tokenizer=None, normalize_digits=True, UNK_ID=3, _DIGIT_RE=re.compile(br"\d")
)

Source from the content-addressed store, hash-verified

1014
1015
1016	def sentence_to_token_ids(
1017	sentence, vocabulary, tokenizer=None, normalize_digits=True, UNK_ID=3, _DIGIT_RE=re.compile(br"\d")
1018	):
1019	"""Convert a string to list of integers representing token-ids.
1020
1021	For example, a sentence "I have a dog" may become tokenized into
1022	["I", "have", "a", "dog"] and with vocabulary {"I": 1, "have": 2,
1023	"a": 4, "dog": 7"} this function will return [1, 2, 4, 7].
1024
1025	Parameters
1026	-----------
1027	sentence : tensorflow.python.platform.gfile.GFile Object
1028	The sentence in bytes format to convert to token-ids, see ``basic_tokenizer()`` and ``data_to_token_ids()``.
1029	vocabulary : dictionary
1030	Mmapping tokens to integers.
1031	tokenizer : function
1032	A function to use to tokenize each sentence. If None, ``basic_tokenizer`` will be used.
1033	normalize_digits : boolean
1034	If true, all digits are replaced by 0.
1035
1036	Returns
1037	--------
1038	list of int
1039	The token-ids for the sentence.
1040
1041	"""
1042	if tokenizer:
1043	words = tokenizer(sentence)
1044	else:
1045	words = basic_tokenizer(sentence)
1046	if not normalize_digits:
1047	return [vocabulary.get(w, UNK_ID) for w in words]
1048	# Normalize digits by 0 before looking words up in the vocabulary.
1049	return [vocabulary.get(re.sub(_DIGIT_RE, b"0", w), UNK_ID) for w in words]
1050
1051
1052	def data_to_token_ids(

Callers 1

data_to_token_idsFunction · 0.85

Calls 2

basic_tokenizerFunction · 0.85

getMethod · 0.80

Tested by

no test coverage detected

Used in the wild real call sites across dependent graphs

searching dependent graphs…