MCPcopy
hub / github.com/ScrapeGraphAI/Scrapegraph-ai / ChromiumLoader

Class ChromiumLoader

scrapegraphai/docloaders/chromium.py:13–490  ·  view source on GitHub ↗

Scrapes HTML pages from URLs using a (headless) instance of the Chromium web driver with proxy protection. Attributes: backend: The web driver backend library; defaults to 'playwright'. browser_config: A dictionary containing additional browser kwargs. headless: Whet

Source from the content-addressed store, hash-verified

11
12
13class ChromiumLoader:
14 """Scrapes HTML pages from URLs using a (headless) instance of the
15 Chromium web driver with proxy protection.
16
17 Attributes:
18 backend: The web driver backend library; defaults to 'playwright'.
19 browser_config: A dictionary containing additional browser kwargs.
20 headless: Whether to run browser in headless mode.
21 proxy: A dictionary containing proxy settings; None disables protection.
22 urls: A list of URLs to scrape content from.
23 requires_js_support: Flag to determine if JS rendering is required.
24 """
25
26 def __init__(
27 self,
28 urls: List[str],
29 *,
30 backend: str = "playwright",
31 headless: bool = True,
32 proxy: Optional[Proxy] = None,
33 load_state: str = "domcontentloaded",
34 requires_js_support: bool = False,
35 storage_state: Optional[str] = None,
36 browser_name: str = "chromium", # default chromium
37 retry_limit: int = 1,
38 timeout: int = 60,
39 **kwargs: Any,
40 ):
41 """Initialize the loader with a list of URL paths.
42
43 Args:
44 backend: The web driver backend library; defaults to 'playwright'.
45 headless: Whether to run browser in headless mode.
46 proxy: A dictionary containing proxy information; None disables protection.
47 urls: A list of URLs to scrape content from.
48 requires_js_support: Whether to use JS rendering for scraping.
49 retry_limit: Maximum number of retry attempts for scraping. Defaults to 3.
50 timeout: Maximum time in seconds to wait for scraping. Defaults to 10.
51 kwargs: A dictionary containing additional browser kwargs.
52
53 Raises:
54 ImportError: If the required backend package is not installed.
55 """
56 message = (
57 f"{backend} is required for ChromiumLoader. "
58 f"Please install it with `pip install {backend}`."
59 )
60
61 dynamic_import(backend, message)
62
63 self.browser_config = kwargs
64 self.headless = headless
65 self.proxy = parse_or_search_proxy(proxy) if proxy else None
66 self.urls = urls
67 self.load_state = load_state
68 self.requires_js_support = requires_js_support
69 self.storage_state = storage_state
70 self.backend = kwargs.get("backend", backend)

Calls

no outgoing calls