In each doc's metadata, there may be a window_ids field indicating the ids of the chunks around the current chunk. These window_ids may overlap, so we - coalesce each overlapping groups into a single window (maintaining ordering), - create a new document for
(
self, docs_scores: List[Tuple[Document, float]], neighbors: int = 0
)
| 272 | pass |
| 273 | |
| 274 | def add_context_window( |
| 275 | self, docs_scores: List[Tuple[Document, float]], neighbors: int = 0 |
| 276 | ) -> List[Tuple[Document, float]]: |
| 277 | """ |
| 278 | In each doc's metadata, there may be a window_ids field indicating |
| 279 | the ids of the chunks around the current chunk. |
| 280 | These window_ids may overlap, so we |
| 281 | - coalesce each overlapping groups into a single window (maintaining ordering), |
| 282 | - create a new document for each part, preserving metadata, |
| 283 | |
| 284 | We may have stored a longer set of window_ids than we need during chunking. |
| 285 | Now, we just want `neighbors` on each side of the center of the window_ids list. |
| 286 | |
| 287 | Args: |
| 288 | docs_scores (List[Tuple[Document, float]]): List of pairs of documents |
| 289 | to add context windows to together with their match scores. |
| 290 | neighbors (int, optional): Number of neighbors on "each side" of match to |
| 291 | retrieve. Defaults to 0. |
| 292 | "Each side" here means before and after the match, |
| 293 | in the original text. |
| 294 | |
| 295 | Returns: |
| 296 | List[Tuple[Document, float]]: List of (Document, score) tuples. |
| 297 | """ |
| 298 | # We return a larger context around each match, i.e. |
| 299 | # a window of `neighbors` on each side of the match. |
| 300 | docs = [d for d, s in docs_scores] |
| 301 | scores = [s for d, s in docs_scores] |
| 302 | if neighbors == 0: |
| 303 | return docs_scores |
| 304 | doc_chunks = [d for d in docs if d.metadata.is_chunk] |
| 305 | if len(doc_chunks) == 0: |
| 306 | return docs_scores |
| 307 | window_ids_list = [] |
| 308 | id2metadata = {} |
| 309 | # id -> highest score of a doc it appears in |
| 310 | id2max_score: Dict[int | str, float] = {} |
| 311 | for i, d in enumerate(docs): |
| 312 | window_ids = d.metadata.window_ids |
| 313 | if len(window_ids) == 0: |
| 314 | window_ids = [d.id()] |
| 315 | id2metadata.update({id: d.metadata for id in window_ids}) |
| 316 | |
| 317 | id2max_score.update( |
| 318 | {id: max(id2max_score.get(id, 0), scores[i]) for id in window_ids} |
| 319 | ) |
| 320 | n = len(window_ids) |
| 321 | chunk_idx = window_ids.index(d.id()) |
| 322 | neighbor_ids = window_ids[ |
| 323 | max(0, chunk_idx - neighbors) : min(n, chunk_idx + neighbors + 1) |
| 324 | ] |
| 325 | window_ids_list += [neighbor_ids] |
| 326 | |
| 327 | # window_ids could be from different docs, |
| 328 | # and they may overlap, so we coalesce overlapping groups into |
| 329 | # separate windows. |
| 330 | window_ids_list = self.remove_overlaps(window_ids_list) |
| 331 | final_docs = [] |