MCPcopy
hub / github.com/tensorlayer/TensorLayer / sentence_to_token_ids

Function sentence_to_token_ids

tensorlayer/nlp.py:1016–1049  ·  view source on GitHub ↗

Convert a string to list of integers representing token-ids. For example, a sentence "I have a dog" may become tokenized into ["I", "have", "a", "dog"] and with vocabulary {"I": 1, "have": 2, "a": 4, "dog": 7"} this function will return [1, 2, 4, 7]. Parameters -----------

(
    sentence, vocabulary, tokenizer=None, normalize_digits=True, UNK_ID=3, _DIGIT_RE=re.compile(br"\d")
)

Source from the content-addressed store, hash-verified

1014
1015
1016def sentence_to_token_ids(
1017 sentence, vocabulary, tokenizer=None, normalize_digits=True, UNK_ID=3, _DIGIT_RE=re.compile(br"\d")
1018):
1019 """Convert a string to list of integers representing token-ids.
1020
1021 For example, a sentence "I have a dog" may become tokenized into
1022 ["I", "have", "a", "dog"] and with vocabulary {"I": 1, "have": 2,
1023 "a": 4, "dog": 7"} this function will return [1, 2, 4, 7].
1024
1025 Parameters
1026 -----------
1027 sentence : tensorflow.python.platform.gfile.GFile Object
1028 The sentence in bytes format to convert to token-ids, see ``basic_tokenizer()`` and ``data_to_token_ids()``.
1029 vocabulary : dictionary
1030 Mmapping tokens to integers.
1031 tokenizer : function
1032 A function to use to tokenize each sentence. If None, ``basic_tokenizer`` will be used.
1033 normalize_digits : boolean
1034 If true, all digits are replaced by 0.
1035
1036 Returns
1037 --------
1038 list of int
1039 The token-ids for the sentence.
1040
1041 """
1042 if tokenizer:
1043 words = tokenizer(sentence)
1044 else:
1045 words = basic_tokenizer(sentence)
1046 if not normalize_digits:
1047 return [vocabulary.get(w, UNK_ID) for w in words]
1048 # Normalize digits by 0 before looking words up in the vocabulary.
1049 return [vocabulary.get(re.sub(_DIGIT_RE, b"0", w), UNK_ID) for w in words]
1050
1051
1052def data_to_token_ids(

Callers 1

data_to_token_idsFunction · 0.85

Calls 2

basic_tokenizerFunction · 0.85
getMethod · 0.80

Tested by

no test coverage detected

Used in the wild real call sites across dependent graphs

searching dependent graphs…