hub / github.com/ArchiveBox/ArchiveBox / wget_output_path

Function wget_output_path

archivebox/extractors/wget.py:129–205 · view source on GitHub ↗

calculate the path to the wgetted .html file, since wget may adjust some paths to be different than the base_url path. See docs on wget --adjust-extension (-E)

(link: Link)

Source from the content-addressed store, hash-verified

127
128	@enforce_types
129	def wget_output_path(link: Link) -> Optional[str]:
130	"""calculate the path to the wgetted .html file, since wget may
131	adjust some paths to be different than the base_url path.
132
133	See docs on wget --adjust-extension (-E)
134	"""
135
136	# Wget downloads can save in a number of different ways depending on the url:
137	# https://example.com
138	# > example.com/index.html
139	# https://example.com?v=zzVa_tX1OiI
140	# > example.com/index.html?v=zzVa_tX1OiI.html
141	# https://www.example.com/?v=zzVa_tX1OiI
142	# > example.com/index.html?v=zzVa_tX1OiI.html
143
144	# https://example.com/abc
145	# > example.com/abc.html
146	# https://example.com/abc/
147	# > example.com/abc/index.html
148	# https://example.com/abc?v=zzVa_tX1OiI.html
149	# > example.com/abc?v=zzVa_tX1OiI.html
150	# https://example.com/abc/?v=zzVa_tX1OiI.html
151	# > example.com/abc/index.html?v=zzVa_tX1OiI.html
152
153	# https://example.com/abc/test.html
154	# > example.com/abc/test.html
155	# https://example.com/abc/test?v=zzVa_tX1OiI
156	# > example.com/abc/test?v=zzVa_tX1OiI.html
157	# https://example.com/abc/test/?v=zzVa_tX1OiI
158	# > example.com/abc/test/index.html?v=zzVa_tX1OiI.html
159
160	# There's also lots of complexity around how the urlencoding and renaming
161	# is done for pages with query and hash fragments or extensions like shtml / htm / php / etc
162
163	# Since the wget algorithm for -E (appending .html) is incredibly complex
164	# and there's no way to get the computed output path from wget
165	# in order to avoid having to reverse-engineer how they calculate it,
166	# we just look in the output folder read the filename wget used from the filesystem
167	full_path = without_fragment(without_query(path(link.url))).strip('/')
168	search_dir = Path(link.link_dir) / domain(link.url).replace(":", "+") / urldecode(full_path)
169	for _ in range(4):
170	if search_dir.exists():
171	if search_dir.is_dir():
172	html_files = [
173	f for f in search_dir.iterdir()
174	if re.search(".+\\.[Ss]?[Hh][Tt][Mm][Ll]?$", str(f), re.I \| re.M)
175	]
176	if html_files:
177	return str(html_files[0].relative_to(link.link_dir))
178
179	# sometimes wget'd URLs have no ext and return non-html
180	# e.g. /some/example/rss/all -> some RSS XML content)
181	# /some/other/url.o4g -> some binary unrecognized ext)
182	# test this with archivebox add --depth=1 https://getpocket.com/users/nikisweeting/feed/all
183	last_part_of_url = urldecode(full_path.rsplit('/', 1)[-1])
184	for file_present in search_dir.iterdir():
185	if file_present == last_part_of_url:
186	return str((search_dir / file_present).relative_to(link.link_dir))

Callers 4

canonical_outputsMethod · 0.85

link_details_templateFunction · 0.85

should_save_wgetFunction · 0.85

save_wgetFunction · 0.85

Calls

no outgoing calls

Tested by

no test coverage detected