hub / github.com/ScrapeGraphAI/Scrapegraph-ai / ChromiumLoader

Class ChromiumLoader

scrapegraphai/docloaders/chromium.py:13–490 · view source on GitHub ↗

Scrapes HTML pages from URLs using a (headless) instance of the Chromium web driver with proxy protection. Attributes: backend: The web driver backend library; defaults to 'playwright'. browser_config: A dictionary containing additional browser kwargs. headless: Whet

Source from the content-addressed store, hash-verified

11
12
13	class ChromiumLoader:
14	"""Scrapes HTML pages from URLs using a (headless) instance of the
15	Chromium web driver with proxy protection.
16
17	Attributes:
18	backend: The web driver backend library; defaults to 'playwright'.
19	browser_config: A dictionary containing additional browser kwargs.
20	headless: Whether to run browser in headless mode.
21	proxy: A dictionary containing proxy settings; None disables protection.
22	urls: A list of URLs to scrape content from.
23	requires_js_support: Flag to determine if JS rendering is required.
24	"""
25
26	def __init__(
27	self,
28	urls: List[str],
29	*,
30	backend: str = "playwright",
31	headless: bool = True,
32	proxy: Optional[Proxy] = None,
33	load_state: str = "domcontentloaded",
34	requires_js_support: bool = False,
35	storage_state: Optional[str] = None,
36	browser_name: str = "chromium", # default chromium
37	retry_limit: int = 1,
38	timeout: int = 60,
39	**kwargs: Any,
40	):
41	"""Initialize the loader with a list of URL paths.
42
43	Args:
44	backend: The web driver backend library; defaults to 'playwright'.
45	headless: Whether to run browser in headless mode.
46	proxy: A dictionary containing proxy information; None disables protection.
47	urls: A list of URLs to scrape content from.
48	requires_js_support: Whether to use JS rendering for scraping.
49	retry_limit: Maximum number of retry attempts for scraping. Defaults to 3.
50	timeout: Maximum time in seconds to wait for scraping. Defaults to 10.
51	kwargs: A dictionary containing additional browser kwargs.
52
53	Raises:
54	ImportError: If the required backend package is not installed.
55	"""
56	message = (
57	f"{backend} is required for ChromiumLoader. "
58	f"Please install it with `pip install {backend}`."
59	)
60
61	dynamic_import(backend, message)
62
63	self.browser_config = kwargs
64	self.headless = headless
65	self.proxy = parse_or_search_proxy(proxy) if proxy else None
66	self.urls = urls
67	self.load_state = load_state
68	self.requires_js_support = requires_js_support
69	self.storage_state = storage_state
70	self.backend = kwargs.get("backend", backend)

Callers 15

loader_with_dummyFunction · 0.90

test_scrape_method_unsupported_backendFunction · 0.90

test_scrape_method_seleniumFunction · 0.90

test_ascrape_playwright_scrollFunction · 0.90

test_ascrape_with_js_supportFunction · 0.90

test_scrape_method_playwrightFunction · 0.90

test_scrape_method_retry_logicFunction · 0.90

test_ascrape_playwright_scroll_invalid_paramsFunction · 0.90

test_ascrape_with_js_support_retry_failureFunction · 0.90

test_ascrape_undetected_chromedriver_successFunction · 0.90

test_ascrape_undetected_chromedriver_unsupported_browserFunction · 0.90

test_alazy_load_partial_failureFunction · 0.90

Calls

no outgoing calls

Tested by 15

loader_with_dummyFunction · 0.72

test_scrape_method_unsupported_backendFunction · 0.72

test_scrape_method_seleniumFunction · 0.72

test_ascrape_playwright_scrollFunction · 0.72

test_ascrape_with_js_supportFunction · 0.72

test_scrape_method_playwrightFunction · 0.72

test_scrape_method_retry_logicFunction · 0.72

test_ascrape_playwright_scroll_invalid_paramsFunction · 0.72

test_ascrape_with_js_support_retry_failureFunction · 0.72

test_ascrape_undetected_chromedriver_successFunction · 0.72

test_ascrape_undetected_chromedriver_unsupported_browserFunction · 0.72

test_alazy_load_partial_failureFunction · 0.72