MCPcopy
hub / github.com/langroid/langroid / add_context_window

Method add_context_window

langroid/vector_store/base.py:274–344  ·  view source on GitHub ↗

In each doc's metadata, there may be a window_ids field indicating the ids of the chunks around the current chunk. These window_ids may overlap, so we - coalesce each overlapping groups into a single window (maintaining ordering), - create a new document for

(
        self, docs_scores: List[Tuple[Document, float]], neighbors: int = 0
    )

Source from the content-addressed store, hash-verified

272 pass
273
274 def add_context_window(
275 self, docs_scores: List[Tuple[Document, float]], neighbors: int = 0
276 ) -> List[Tuple[Document, float]]:
277 """
278 In each doc's metadata, there may be a window_ids field indicating
279 the ids of the chunks around the current chunk.
280 These window_ids may overlap, so we
281 - coalesce each overlapping groups into a single window (maintaining ordering),
282 - create a new document for each part, preserving metadata,
283
284 We may have stored a longer set of window_ids than we need during chunking.
285 Now, we just want `neighbors` on each side of the center of the window_ids list.
286
287 Args:
288 docs_scores (List[Tuple[Document, float]]): List of pairs of documents
289 to add context windows to together with their match scores.
290 neighbors (int, optional): Number of neighbors on "each side" of match to
291 retrieve. Defaults to 0.
292 "Each side" here means before and after the match,
293 in the original text.
294
295 Returns:
296 List[Tuple[Document, float]]: List of (Document, score) tuples.
297 """
298 # We return a larger context around each match, i.e.
299 # a window of `neighbors` on each side of the match.
300 docs = [d for d, s in docs_scores]
301 scores = [s for d, s in docs_scores]
302 if neighbors == 0:
303 return docs_scores
304 doc_chunks = [d for d in docs if d.metadata.is_chunk]
305 if len(doc_chunks) == 0:
306 return docs_scores
307 window_ids_list = []
308 id2metadata = {}
309 # id -> highest score of a doc it appears in
310 id2max_score: Dict[int | str, float] = {}
311 for i, d in enumerate(docs):
312 window_ids = d.metadata.window_ids
313 if len(window_ids) == 0:
314 window_ids = [d.id()]
315 id2metadata.update({id: d.metadata for id in window_ids})
316
317 id2max_score.update(
318 {id: max(id2max_score.get(id, 0), scores[i]) for id in window_ids}
319 )
320 n = len(window_ids)
321 chunk_idx = window_ids.index(d.id())
322 neighbor_ids = window_ids[
323 max(0, chunk_idx - neighbors) : min(n, chunk_idx + neighbors + 1)
324 ]
325 window_ids_list += [neighbor_ids]
326
327 # window_ids could be from different docs,
328 # and they may overlap, so we coalesce overlapping groups into
329 # separate windows.
330 window_ids_list = self.remove_overlaps(window_ids_list)
331 final_docs = []

Calls 8

remove_overlapsMethod · 0.95
get_documents_by_idsMethod · 0.95
DocumentClass · 0.90
idMethod · 0.80
getMethod · 0.80
deepcopyMethod · 0.80
new_idMethod · 0.80
updateMethod · 0.45