MCPcopy
hub / github.com/huggingface/datasets / resolve_pattern

Function resolve_pattern

src/datasets/data_files.py:301–404  ·  view source on GitHub ↗

Resolve the paths and URLs of the data files from the pattern passed by the user. You can use patterns to resolve multiple local files. Here are a few examples: - *.csv to match all the CSV files at the first level - **.csv to match all the CSV files at any level - data/* to ma

(
    pattern: str,
    base_path: str,
    allowed_extensions: Optional[list[str]] = None,
    download_config: Optional[DownloadConfig] = None,
)

Source from the content-addressed store, hash-verified

299
300
301def resolve_pattern(
302 pattern: str,
303 base_path: str,
304 allowed_extensions: Optional[list[str]] = None,
305 download_config: Optional[DownloadConfig] = None,
306) -> list[str]:
307 """
308 Resolve the paths and URLs of the data files from the pattern passed by the user.
309
310 You can use patterns to resolve multiple local files. Here are a few examples:
311 - *.csv to match all the CSV files at the first level
312 - **.csv to match all the CSV files at any level
313 - data/* to match all the files inside "data"
314 - data/** to match all the files inside "data" and its subdirectories
315
316 The patterns are resolved using the fsspec glob. In fsspec>=2023.12.0 this is equivalent to
317 Python's glob.glob, Path.glob, Path.match and fnmatch where ** is unsupported with a prefix/suffix
318 other than a forward slash /.
319
320 More generally:
321 - '*' matches any character except a forward-slash (to match just the file or directory name)
322 - '**' matches any character including a forward-slash /
323
324 Hidden files and directories (i.e. whose names start with a dot) are ignored, unless they are explicitly requested.
325 The same applies to special directories that start with a double underscore like "__pycache__".
326 You can still include one if the pattern explicitly mentions it:
327 - to include a hidden file: "*/.hidden.txt" or "*/.*"
328 - to include a hidden directory: ".hidden/*" or ".*/*"
329 - to include a special directory: "__special__/*" or "__*/*"
330
331 Example::
332
333 >>> from datasets.data_files import resolve_pattern
334 >>> base_path = "."
335 >>> resolve_pattern("docs/**/*.py", base_path)
336 [/Users/mariosasko/Desktop/projects/datasets/docs/source/_config.py']
337
338 Args:
339 pattern (str): Unix pattern or paths or URLs of the data files to resolve.
340 The paths can be absolute or relative to base_path.
341 Remote filesystems using fsspec are supported, e.g. with the hf:// protocol.
342 base_path (str): Base path to use when resolving relative paths.
343 allowed_extensions (Optional[list], optional): White-list of file extensions to use. Defaults to None (all extensions).
344 For example: allowed_extensions=[".csv", ".json", ".txt", ".parquet"]
345 download_config ([`DownloadConfig`], *optional*): Specific download configuration parameters.
346 Returns:
347 List[str]: List of paths or URLs to the local or remote files that match the patterns.
348 """
349 if is_relative_path(pattern):
350 pattern = xjoin(base_path, pattern)
351 elif is_local_path(pattern):
352 base_path = os.path.splitdrive(pattern)[0] + os.sep
353 else:
354 base_path = ""
355 pattern, storage_options = _prepare_path_and_storage_options(pattern, download_config=download_config)
356 fs, fs_pattern = url_to_fs(pattern, **storage_options)
357 files_to_ignore = set(FILES_TO_IGNORE) - {xbasename(pattern)}
358 protocol = (

Calls 11

is_relative_pathFunction · 0.85
xjoinFunction · 0.85
is_local_pathFunction · 0.85
xbasenameFunction · 0.85
splitMethod · 0.80
itemsMethod · 0.80
globMethod · 0.80
infoMethod · 0.45