MCPcopy Index your code
hub / github.com/togethercomputer/RedPajama-Data / split_paragraphs

Function split_paragraphs

app/src/core/document.py:16–37  ·  view source on GitHub ↗

This function is adapted from dolma: https://github.com/allenai/dolma Split a string into paragraphs. A paragraph is defined as a sequence of zero or more characters, followed by a newline character, or a sequence of one or more characters, followed by the end of the string.

(
        text: str, normalizer: Callable[[str], str], remove_empty: bool = True
)

Source from the content-addressed store, hash-verified

14
15
16def split_paragraphs(
17 text: str, normalizer: Callable[[str], str], remove_empty: bool = True
18) -> Tuple[TextSlice]:
19 """
20 This function is adapted from dolma: https://github.com/allenai/dolma
21
22 Split a string into paragraphs. A paragraph is defined as a sequence of
23 zero or more characters, followed by a newline character, or a sequence
24 of one or more characters, followed by the end of the string.
25 """
26 text_slices = tuple(
27 TextSlice(normalizer(text[match.start():match.end()]), match.start(),
28 match.end())
29 for match in re.finditer(r"([^\n]*\n|[^\n]+$)", text)
30 )
31
32 if remove_empty is True:
33 text_slices = tuple(
34 text_slice for text_slice in text_slices if text_slice[0].strip()
35 )
36
37 return text_slices
38
39
40class Document:

Callers 1

__init__Method · 0.85

Calls 1

TextSliceClass · 0.90

Tested by

no test coverage detected