MCPcopy
hub / github.com/ScrapeGraphAI/Scrapegraph-ai / execute

Method execute

scrapegraphai/nodes/robots_node.py:57–131  ·  view source on GitHub ↗

Checks if a website is scrapeable based on the robots.txt file and updates the state with the scrapeability status. The method constructs a prompt for the language model, submits it, and parses the output to determine if scraping is allowed. Args: state

(self, state: dict)

Source from the content-addressed store, hash-verified

55 )
56
57 def execute(self, state: dict) -> dict:
58 """
59 Checks if a website is scrapeable based on the robots.txt file and updates the state
60 with the scrapeability status. The method constructs a prompt for the language model,
61 submits it, and parses the output to determine if scraping is allowed.
62
63 Args:
64 state (dict): The current state of the graph. The input keys will be used to fetch the
65
66 Returns:
67 dict: The updated state with the output key containing the scrapeability status.
68
69 Raises:
70 KeyError: If the input keys are not found in the state, indicating that the
71 necessary information for checking scrapeability is missing.
72 KeyError: If the large language model is not found in the robots_dictionary.
73 ValueError: If the website is not scrapeable based on the robots.txt file and
74 scraping is not enforced.
75 """
76
77 self.logger.info(f"--- Executing {self.node_name} Node ---")
78
79 input_keys = self.get_input_keys(state)
80
81 input_data = [state[key] for key in input_keys]
82
83 source = input_data[0]
84 output_parser = CommaSeparatedListOutputParser()
85
86 if not source.startswith("http"):
87 raise ValueError("Operation not allowed")
88
89 else:
90 parsed_url = urlparse(source)
91 base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"
92 from langchain_community.document_loaders import AsyncChromiumLoader
93 loader = AsyncChromiumLoader(f"{base_url}/robots.txt")
94 document = loader.load()
95 if "ollama" in self.llm_model.model:
96 self.llm_model.model = self.llm_model.model.split("/")[-1]
97 model = self.llm_model.model.split("/")[-1]
98 else:
99 model = self.llm_model.model
100 try:
101 agent = robots_dictionary[model]
102
103 except KeyError:
104 agent = model
105
106 prompt = PromptTemplate(
107 template=TEMPLATE_ROBOT,
108 input_variables=["path"],
109 partial_variables={"context": document, "agent": agent},
110 )
111
112 chain = prompt | self.llm_model | output_parser
113 is_scrapable = chain.invoke({"path": source})[0]
114

Callers

nothing calls this directly

Calls 6

get_input_keysMethod · 0.80
invokeMethod · 0.80
warningMethod · 0.80
updateMethod · 0.80
infoMethod · 0.45
loadMethod · 0.45

Tested by

no test coverage detected