MCPcopy Index your code
hub / github.com/tensorlayer/TensorLayer / data_to_token_ids

Function data_to_token_ids

tensorlayer/nlp.py:1052–1095  ·  view source on GitHub ↗

Tokenize data file and turn into token-ids using given vocabulary file. This function loads data line-by-line from data_path, calls the above sentence_to_token_ids, and saves the result to target_path. See comment for sentence_to_token_ids on the details of token-ids format. Parame

(
    data_path, target_path, vocabulary_path, tokenizer=None, normalize_digits=True, UNK_ID=3,
    _DIGIT_RE=re.compile(br"\d")
)

Source from the content-addressed store, hash-verified

1050
1051
1052def data_to_token_ids(
1053 data_path, target_path, vocabulary_path, tokenizer=None, normalize_digits=True, UNK_ID=3,
1054 _DIGIT_RE=re.compile(br"\d")
1055):
1056 """Tokenize data file and turn into token-ids using given vocabulary file.
1057
1058 This function loads data line-by-line from data_path, calls the above
1059 sentence_to_token_ids, and saves the result to target_path. See comment
1060 for sentence_to_token_ids on the details of token-ids format.
1061
1062 Parameters
1063 -----------
1064 data_path : str
1065 Path to the data file in one-sentence-per-line format.
1066 target_path : str
1067 Path where the file with token-ids will be created.
1068 vocabulary_path : str
1069 Path to the vocabulary file.
1070 tokenizer : function
1071 A function to use to tokenize each sentence. If None, ``basic_tokenizer`` will be used.
1072 normalize_digits : boolean
1073 If true, all digits are replaced by 0.
1074
1075 References
1076 ----------
1077 - Code from ``/tensorflow/models/rnn/translation/data_utils.py``
1078
1079 """
1080 if not gfile.Exists(target_path):
1081 tl.logging.info("Tokenizing data in %s" % data_path)
1082 vocab, _ = initialize_vocabulary(vocabulary_path)
1083 with gfile.GFile(data_path, mode="rb") as data_file:
1084 with gfile.GFile(target_path, mode="w") as tokens_file:
1085 counter = 0
1086 for line in data_file:
1087 counter += 1
1088 if counter % 100000 == 0:
1089 tl.logging.info(" tokenizing line %d" % counter)
1090 token_ids = sentence_to_token_ids(
1091 line, vocab, tokenizer, normalize_digits, UNK_ID=UNK_ID, _DIGIT_RE=_DIGIT_RE
1092 )
1093 tokens_file.write(" ".join([str(tok) for tok in token_ids]) + "\n")
1094 else:
1095 tl.logging.info("Target path %s exists" % target_path)
1096
1097
1098def moses_multi_bleu(hypotheses, references, lowercase=False):

Callers

nothing calls this directly

Calls 2

initialize_vocabularyFunction · 0.85
sentence_to_token_idsFunction · 0.85

Tested by

no test coverage detected

Used in the wild real call sites across dependent graphs

searching dependent graphs…