Scrapes HTML pages from URLs using a (headless) instance of the Chromium web driver with proxy protection. Attributes: backend: The web driver backend library; defaults to 'playwright'. browser_config: A dictionary containing additional browser kwargs. headless: Whet
| 11 | |
| 12 | |
| 13 | class ChromiumLoader: |
| 14 | """Scrapes HTML pages from URLs using a (headless) instance of the |
| 15 | Chromium web driver with proxy protection. |
| 16 | |
| 17 | Attributes: |
| 18 | backend: The web driver backend library; defaults to 'playwright'. |
| 19 | browser_config: A dictionary containing additional browser kwargs. |
| 20 | headless: Whether to run browser in headless mode. |
| 21 | proxy: A dictionary containing proxy settings; None disables protection. |
| 22 | urls: A list of URLs to scrape content from. |
| 23 | requires_js_support: Flag to determine if JS rendering is required. |
| 24 | """ |
| 25 | |
| 26 | def __init__( |
| 27 | self, |
| 28 | urls: List[str], |
| 29 | *, |
| 30 | backend: str = "playwright", |
| 31 | headless: bool = True, |
| 32 | proxy: Optional[Proxy] = None, |
| 33 | load_state: str = "domcontentloaded", |
| 34 | requires_js_support: bool = False, |
| 35 | storage_state: Optional[str] = None, |
| 36 | browser_name: str = "chromium", # default chromium |
| 37 | retry_limit: int = 1, |
| 38 | timeout: int = 60, |
| 39 | **kwargs: Any, |
| 40 | ): |
| 41 | """Initialize the loader with a list of URL paths. |
| 42 | |
| 43 | Args: |
| 44 | backend: The web driver backend library; defaults to 'playwright'. |
| 45 | headless: Whether to run browser in headless mode. |
| 46 | proxy: A dictionary containing proxy information; None disables protection. |
| 47 | urls: A list of URLs to scrape content from. |
| 48 | requires_js_support: Whether to use JS rendering for scraping. |
| 49 | retry_limit: Maximum number of retry attempts for scraping. Defaults to 3. |
| 50 | timeout: Maximum time in seconds to wait for scraping. Defaults to 10. |
| 51 | kwargs: A dictionary containing additional browser kwargs. |
| 52 | |
| 53 | Raises: |
| 54 | ImportError: If the required backend package is not installed. |
| 55 | """ |
| 56 | message = ( |
| 57 | f"{backend} is required for ChromiumLoader. " |
| 58 | f"Please install it with `pip install {backend}`." |
| 59 | ) |
| 60 | |
| 61 | dynamic_import(backend, message) |
| 62 | |
| 63 | self.browser_config = kwargs |
| 64 | self.headless = headless |
| 65 | self.proxy = parse_or_search_proxy(proxy) if proxy else None |
| 66 | self.urls = urls |
| 67 | self.load_state = load_state |
| 68 | self.requires_js_support = requires_js_support |
| 69 | self.storage_state = storage_state |
| 70 | self.backend = kwargs.get("backend", backend) |
no outgoing calls