hub / github.com/langroid/langroid / add_context_window

Method add_context_window

langroid/vector_store/base.py:274–344 · view source on GitHub ↗

In each doc's metadata, there may be a window_ids field indicating the ids of the chunks around the current chunk. These window_ids may overlap, so we - coalesce each overlapping groups into a single window (maintaining ordering), - create a new document for

(
        self, docs_scores: List[Tuple[Document, float]], neighbors: int = 0
    )

Source from the content-addressed store, hash-verified

272	pass
273
274	def add_context_window(
275	self, docs_scores: List[Tuple[Document, float]], neighbors: int = 0
276	) -> List[Tuple[Document, float]]:
277	"""
278	In each doc's metadata, there may be a window_ids field indicating
279	the ids of the chunks around the current chunk.
280	These window_ids may overlap, so we
281	- coalesce each overlapping groups into a single window (maintaining ordering),
282	- create a new document for each part, preserving metadata,
283
284	We may have stored a longer set of window_ids than we need during chunking.
285	Now, we just want `neighbors` on each side of the center of the window_ids list.
286
287	Args:
288	docs_scores (List[Tuple[Document, float]]): List of pairs of documents
289	to add context windows to together with their match scores.
290	neighbors (int, optional): Number of neighbors on "each side" of match to
291	retrieve. Defaults to 0.
292	"Each side" here means before and after the match,
293	in the original text.
294
295	Returns:
296	List[Tuple[Document, float]]: List of (Document, score) tuples.
297	"""
298	# We return a larger context around each match, i.e.
299	# a window of `neighbors` on each side of the match.
300	docs = [d for d, s in docs_scores]
301	scores = [s for d, s in docs_scores]
302	if neighbors == 0:
303	return docs_scores
304	doc_chunks = [d for d in docs if d.metadata.is_chunk]
305	if len(doc_chunks) == 0:
306	return docs_scores
307	window_ids_list = []
308	id2metadata = {}
309	# id -> highest score of a doc it appears in
310	id2max_score: Dict[int \| str, float] = {}
311	for i, d in enumerate(docs):
312	window_ids = d.metadata.window_ids
313	if len(window_ids) == 0:
314	window_ids = [d.id()]
315	id2metadata.update({id: d.metadata for id in window_ids})
316
317	id2max_score.update(
318	{id: max(id2max_score.get(id, 0), scores[i]) for id in window_ids}
319	)
320	n = len(window_ids)
321	chunk_idx = window_ids.index(d.id())
322	neighbor_ids = window_ids[
323	max(0, chunk_idx - neighbors) : min(n, chunk_idx + neighbors + 1)
324	]
325	window_ids_list += [neighbor_ids]
326
327	# window_ids could be from different docs,
328	# and they may overlap, so we coalesce overlapping groups into
329	# separate windows.
330	window_ids_list = self.remove_overlaps(window_ids_list)
331	final_docs = []

Callers 2

test_vector_stores_context_windowFunction · 0.45

test_vector_stores_overlapping_matchesFunction · 0.45

Calls 8

remove_overlapsMethod · 0.95

get_documents_by_idsMethod · 0.95

DocumentClass · 0.90

idMethod · 0.80

getMethod · 0.80

deepcopyMethod · 0.80

new_idMethod · 0.80

updateMethod · 0.45

Tested by 2

test_vector_stores_context_windowFunction · 0.36

test_vector_stores_overlapping_matchesFunction · 0.36