Convert a string to list of integers representing token-ids. For example, a sentence "I have a dog" may become tokenized into ["I", "have", "a", "dog"] and with vocabulary {"I": 1, "have": 2, "a": 4, "dog": 7"} this function will return [1, 2, 4, 7]. Parameters -----------
(
sentence, vocabulary, tokenizer=None, normalize_digits=True, UNK_ID=3, _DIGIT_RE=re.compile(br"\d")
)
| 1014 | |
| 1015 | |
| 1016 | def sentence_to_token_ids( |
| 1017 | sentence, vocabulary, tokenizer=None, normalize_digits=True, UNK_ID=3, _DIGIT_RE=re.compile(br"\d") |
| 1018 | ): |
| 1019 | """Convert a string to list of integers representing token-ids. |
| 1020 | |
| 1021 | For example, a sentence "I have a dog" may become tokenized into |
| 1022 | ["I", "have", "a", "dog"] and with vocabulary {"I": 1, "have": 2, |
| 1023 | "a": 4, "dog": 7"} this function will return [1, 2, 4, 7]. |
| 1024 | |
| 1025 | Parameters |
| 1026 | ----------- |
| 1027 | sentence : tensorflow.python.platform.gfile.GFile Object |
| 1028 | The sentence in bytes format to convert to token-ids, see ``basic_tokenizer()`` and ``data_to_token_ids()``. |
| 1029 | vocabulary : dictionary |
| 1030 | Mmapping tokens to integers. |
| 1031 | tokenizer : function |
| 1032 | A function to use to tokenize each sentence. If None, ``basic_tokenizer`` will be used. |
| 1033 | normalize_digits : boolean |
| 1034 | If true, all digits are replaced by 0. |
| 1035 | |
| 1036 | Returns |
| 1037 | -------- |
| 1038 | list of int |
| 1039 | The token-ids for the sentence. |
| 1040 | |
| 1041 | """ |
| 1042 | if tokenizer: |
| 1043 | words = tokenizer(sentence) |
| 1044 | else: |
| 1045 | words = basic_tokenizer(sentence) |
| 1046 | if not normalize_digits: |
| 1047 | return [vocabulary.get(w, UNK_ID) for w in words] |
| 1048 | # Normalize digits by 0 before looking words up in the vocabulary. |
| 1049 | return [vocabulary.get(re.sub(_DIGIT_RE, b"0", w), UNK_ID) for w in words] |
| 1050 | |
| 1051 | |
| 1052 | def data_to_token_ids( |
no test coverage detected
searching dependent graphs…