This function is adapted from dolma: https://github.com/allenai/dolma Split a string into paragraphs. A paragraph is defined as a sequence of zero or more characters, followed by a newline character, or a sequence of one or more characters, followed by the end of the string.
(
text: str, normalizer: Callable[[str], str], remove_empty: bool = True
)
| 14 | |
| 15 | |
| 16 | def split_paragraphs( |
| 17 | text: str, normalizer: Callable[[str], str], remove_empty: bool = True |
| 18 | ) -> Tuple[TextSlice]: |
| 19 | """ |
| 20 | This function is adapted from dolma: https://github.com/allenai/dolma |
| 21 | |
| 22 | Split a string into paragraphs. A paragraph is defined as a sequence of |
| 23 | zero or more characters, followed by a newline character, or a sequence |
| 24 | of one or more characters, followed by the end of the string. |
| 25 | """ |
| 26 | text_slices = tuple( |
| 27 | TextSlice(normalizer(text[match.start():match.end()]), match.start(), |
| 28 | match.end()) |
| 29 | for match in re.finditer(r"([^\n]*\n|[^\n]+$)", text) |
| 30 | ) |
| 31 | |
| 32 | if remove_empty is True: |
| 33 | text_slices = tuple( |
| 34 | text_slice for text_slice in text_slices if text_slice[0].strip() |
| 35 | ) |
| 36 | |
| 37 | return text_slices |
| 38 | |
| 39 | |
| 40 | class Document: |