hub / github.com/togethercomputer/RedPajama-Data / split_paragraphs

Function split_paragraphs

app/src/core/document.py:16–37 · view source on GitHub ↗

This function is adapted from dolma: https://github.com/allenai/dolma Split a string into paragraphs. A paragraph is defined as a sequence of zero or more characters, followed by a newline character, or a sequence of one or more characters, followed by the end of the string.

(
        text: str, normalizer: Callable[[str], str], remove_empty: bool = True
)

Source from the content-addressed store, hash-verified

14
15
16	def split_paragraphs(
17	text: str, normalizer: Callable[[str], str], remove_empty: bool = True
18	) -> Tuple[TextSlice]:
19	"""
20	This function is adapted from dolma: https://github.com/allenai/dolma
21
22	Split a string into paragraphs. A paragraph is defined as a sequence of
23	zero or more characters, followed by a newline character, or a sequence
24	of one or more characters, followed by the end of the string.
25	"""
26	text_slices = tuple(
27	TextSlice(normalizer(text[match.start():match.end()]), match.start(),
28	match.end())
29	for match in re.finditer(r"([^\n]*\n\|[^\n]+$)", text)
30	)
31
32	if remove_empty is True:
33	text_slices = tuple(
34	text_slice for text_slice in text_slices if text_slice[0].strip()
35	)
36
37	return text_slices
38
39
40	class Document:

Callers 1

__init__Method · 0.85

Calls 1

TextSliceClass · 0.90

Tested by

no test coverage detected