Fetches pages using Plasmate — a lightweight Rust browser engine that outputs Structured Object Model (SOM) instead of raw HTML. Advantages over ChromiumLoader for static / server-rendered pages: - No Chrome/Playwright required — single binary, installs via pip - ~64MB RAM per sessi
| 44 | |
| 45 | |
| 46 | class PlasmateLoader(BaseLoader): |
| 47 | """Fetches pages using Plasmate — a lightweight Rust browser engine that outputs |
| 48 | Structured Object Model (SOM) instead of raw HTML. |
| 49 | |
| 50 | Advantages over ChromiumLoader for static / server-rendered pages: |
| 51 | - No Chrome/Playwright required — single binary, installs via pip |
| 52 | - ~64MB RAM per session vs ~300MB for Chromium |
| 53 | - 10-100x fewer tokens per page (SOM strips nav, ads, boilerplate) |
| 54 | - Drops into existing ScrapeGraphAI workflows with minimal config changes |
| 55 | |
| 56 | For SPAs or pages that require JavaScript rendering, set ``fallback_to_chrome=True`` |
| 57 | to automatically retry with ChromiumLoader on empty or error responses. |
| 58 | |
| 59 | Attributes: |
| 60 | urls: List of URLs to fetch. |
| 61 | output_format: Plasmate output format — ``"text"`` (default, most compatible), |
| 62 | ``"som"`` (full JSON), or ``"markdown"``. |
| 63 | timeout: Per-request timeout in seconds. Defaults to 30. |
| 64 | selector: Optional ARIA role or CSS id selector to scope extraction |
| 65 | (e.g. ``"main"`` or ``"#content"``). |
| 66 | extra_headers: Optional dict of HTTP headers to pass to each request. |
| 67 | fallback_to_chrome: If True, retry with ChromiumLoader when Plasmate |
| 68 | returns empty content (useful for JS-heavy SPAs). Defaults to False. |
| 69 | chrome_kwargs: Extra kwargs forwarded to ChromiumLoader when fallback is used. |
| 70 | |
| 71 | Example:: |
| 72 | |
| 73 | from scrapegraphai.docloaders import PlasmateLoader |
| 74 | |
| 75 | loader = PlasmateLoader( |
| 76 | urls=["https://docs.python.org/3/library/json.html"], |
| 77 | output_format="text", |
| 78 | timeout=30, |
| 79 | ) |
| 80 | docs = loader.load() |
| 81 | print(docs[0].page_content[:500]) |
| 82 | """ |
| 83 | |
| 84 | def __init__( |
| 85 | self, |
| 86 | urls: List[str], |
| 87 | *, |
| 88 | output_format: str = "text", |
| 89 | timeout: int = 30, |
| 90 | selector: Optional[str] = None, |
| 91 | extra_headers: Optional[dict] = None, |
| 92 | fallback_to_chrome: bool = False, |
| 93 | **chrome_kwargs, |
| 94 | ): |
| 95 | if output_format not in ("som", "text", "markdown", "links"): |
| 96 | raise ValueError( |
| 97 | f"output_format must be one of 'som', 'text', 'markdown', 'links'; got {output_format!r}" |
| 98 | ) |
| 99 | self.urls = urls |
| 100 | self.output_format = output_format |
| 101 | self.timeout = timeout |
| 102 | self.selector = selector |
| 103 | self.extra_headers = extra_headers or {} |
no outgoing calls