Reads through an analogy question file, return its id format. Parameters ---------- eval_file : str The file name. word2id : dictionary a dictionary that maps word to ID. Returns -------- numpy.array A ``[n_examples, 4]`` numpy array containing t
(eval_file='questions-words.txt', word2id=None)
| 544 | |
| 545 | |
| 546 | def read_analogies_file(eval_file='questions-words.txt', word2id=None): |
| 547 | """Reads through an analogy question file, return its id format. |
| 548 | |
| 549 | Parameters |
| 550 | ---------- |
| 551 | eval_file : str |
| 552 | The file name. |
| 553 | word2id : dictionary |
| 554 | a dictionary that maps word to ID. |
| 555 | |
| 556 | Returns |
| 557 | -------- |
| 558 | numpy.array |
| 559 | A ``[n_examples, 4]`` numpy array containing the analogy question's word IDs. |
| 560 | |
| 561 | Examples |
| 562 | --------- |
| 563 | The file should be in this format |
| 564 | |
| 565 | >>> : capital-common-countries |
| 566 | >>> Athens Greece Baghdad Iraq |
| 567 | >>> Athens Greece Bangkok Thailand |
| 568 | >>> Athens Greece Beijing China |
| 569 | >>> Athens Greece Berlin Germany |
| 570 | >>> Athens Greece Bern Switzerland |
| 571 | >>> Athens Greece Cairo Egypt |
| 572 | >>> Athens Greece Canberra Australia |
| 573 | >>> Athens Greece Hanoi Vietnam |
| 574 | >>> Athens Greece Havana Cuba |
| 575 | |
| 576 | Get the tokenized analogy question data |
| 577 | |
| 578 | >>> words = tl.files.load_matt_mahoney_text8_dataset() |
| 579 | >>> data, count, dictionary, reverse_dictionary = tl.nlp.build_words_dataset(words, vocabulary_size, True) |
| 580 | >>> analogy_questions = tl.nlp.read_analogies_file(eval_file='questions-words.txt', word2id=dictionary) |
| 581 | >>> print(analogy_questions) |
| 582 | [[ 3068 1248 7161 1581] |
| 583 | [ 3068 1248 28683 5642] |
| 584 | [ 3068 1248 3878 486] |
| 585 | ..., |
| 586 | [ 1216 4309 19982 25506] |
| 587 | [ 1216 4309 3194 8650] |
| 588 | [ 1216 4309 140 312]] |
| 589 | |
| 590 | """ |
| 591 | if word2id is None: |
| 592 | word2id = {} |
| 593 | |
| 594 | questions = [] |
| 595 | questions_skipped = 0 |
| 596 | |
| 597 | with open(eval_file, "rb") as analogy_f: |
| 598 | for line in analogy_f: |
| 599 | if line.startswith(b":"): # Skip comments. |
| 600 | continue |
| 601 | words = line.strip().lower().split(b" ") # lowercase |
| 602 | ids = [word2id.get(w.strip().decode()) for w in words] |
| 603 | if None in ids or len(ids) != 4: |
nothing calls this directly
no test coverage detected
searching dependent graphs…