MCPcopy Index your code
hub / github.com/ScrapeGraphAI/Scrapegraph-ai / PlasmateLoader

Class PlasmateLoader

scrapegraphai/docloaders/plasmate.py:46–203  ·  view source on GitHub ↗

Fetches pages using Plasmate — a lightweight Rust browser engine that outputs Structured Object Model (SOM) instead of raw HTML. Advantages over ChromiumLoader for static / server-rendered pages: - No Chrome/Playwright required — single binary, installs via pip - ~64MB RAM per sessi

Source from the content-addressed store, hash-verified

44
45
46class PlasmateLoader(BaseLoader):
47 """Fetches pages using Plasmate — a lightweight Rust browser engine that outputs
48 Structured Object Model (SOM) instead of raw HTML.
49
50 Advantages over ChromiumLoader for static / server-rendered pages:
51 - No Chrome/Playwright required — single binary, installs via pip
52 - ~64MB RAM per session vs ~300MB for Chromium
53 - 10-100x fewer tokens per page (SOM strips nav, ads, boilerplate)
54 - Drops into existing ScrapeGraphAI workflows with minimal config changes
55
56 For SPAs or pages that require JavaScript rendering, set ``fallback_to_chrome=True``
57 to automatically retry with ChromiumLoader on empty or error responses.
58
59 Attributes:
60 urls: List of URLs to fetch.
61 output_format: Plasmate output format — ``"text"`` (default, most compatible),
62 ``"som"`` (full JSON), or ``"markdown"``.
63 timeout: Per-request timeout in seconds. Defaults to 30.
64 selector: Optional ARIA role or CSS id selector to scope extraction
65 (e.g. ``"main"`` or ``"#content"``).
66 extra_headers: Optional dict of HTTP headers to pass to each request.
67 fallback_to_chrome: If True, retry with ChromiumLoader when Plasmate
68 returns empty content (useful for JS-heavy SPAs). Defaults to False.
69 chrome_kwargs: Extra kwargs forwarded to ChromiumLoader when fallback is used.
70
71 Example::
72
73 from scrapegraphai.docloaders import PlasmateLoader
74
75 loader = PlasmateLoader(
76 urls=["https://docs.python.org/3/library/json.html"],
77 output_format="text",
78 timeout=30,
79 )
80 docs = loader.load()
81 print(docs[0].page_content[:500])
82 """
83
84 def __init__(
85 self,
86 urls: List[str],
87 *,
88 output_format: str = "text",
89 timeout: int = 30,
90 selector: Optional[str] = None,
91 extra_headers: Optional[dict] = None,
92 fallback_to_chrome: bool = False,
93 **chrome_kwargs,
94 ):
95 if output_format not in ("som", "text", "markdown", "links"):
96 raise ValueError(
97 f"output_format must be one of 'som', 'text', 'markdown', 'links'; got {output_format!r}"
98 )
99 self.urls = urls
100 self.output_format = output_format
101 self.timeout = timeout
102 self.selector = selector
103 self.extra_headers = extra_headers or {}

Callers 2

_make_loaderFunction · 0.90
handle_web_sourceMethod · 0.85

Calls

no outgoing calls

Tested by 1

_make_loaderFunction · 0.72