Build a Huffman tree for the tokens in `text` and compute each token's binary encoding. Notes ----- In a Huffman code, tokens that occur more frequently are (generally) represented using fewer bits. Huffman codes produce the minimum expected
(self, text)
| 430 | |
| 431 | class HuffmanEncoder(object): |
| 432 | def fit(self, text): |
| 433 | """ |
| 434 | Build a Huffman tree for the tokens in `text` and compute each token's |
| 435 | binary encoding. |
| 436 | |
| 437 | Notes |
| 438 | ----- |
| 439 | In a Huffman code, tokens that occur more frequently are (generally) |
| 440 | represented using fewer bits. Huffman codes produce the minimum expected |
| 441 | codeword length among all methods for encoding tokens individually. |
| 442 | |
| 443 | Huffman codes correspond to paths through a binary tree, with 1 |
| 444 | corresponding to "move right" and 0 corresponding to "move left". In |
| 445 | contrast to standard binary trees, the Huffman tree is constructed from the |
| 446 | bottom up. Construction begins by initializing a min-heap priority queue |
| 447 | consisting of each token in the corpus, with priority corresponding to the |
| 448 | token frequency. At each step, the two most infrequent tokens in the corpus |
| 449 | are removed and become the children of a parent pseudotoken whose |
| 450 | "frequency" is the sum of the frequencies of its children. This new parent |
| 451 | pseudotoken is added to the priority queue and the process is repeated |
| 452 | recursively until no tokens remain. |
| 453 | |
| 454 | Parameters |
| 455 | ---------- |
| 456 | text: list of strs or :class:`Vocabulary` instance |
| 457 | The tokenized text or a pretrained :class:`Vocabulary` object to use for |
| 458 | building the Huffman code. |
| 459 | """ |
| 460 | self._build_tree(text) |
| 461 | self._generate_codes() |
| 462 | |
| 463 | def transform(self, text): |
| 464 | """ |