Resolve the paths and URLs of the data files from the pattern passed by the user. You can use patterns to resolve multiple local files. Here are a few examples: - *.csv to match all the CSV files at the first level - **.csv to match all the CSV files at any level - data/* to ma
(
pattern: str,
base_path: str,
allowed_extensions: Optional[list[str]] = None,
download_config: Optional[DownloadConfig] = None,
)
| 299 | |
| 300 | |
| 301 | def resolve_pattern( |
| 302 | pattern: str, |
| 303 | base_path: str, |
| 304 | allowed_extensions: Optional[list[str]] = None, |
| 305 | download_config: Optional[DownloadConfig] = None, |
| 306 | ) -> list[str]: |
| 307 | """ |
| 308 | Resolve the paths and URLs of the data files from the pattern passed by the user. |
| 309 | |
| 310 | You can use patterns to resolve multiple local files. Here are a few examples: |
| 311 | - *.csv to match all the CSV files at the first level |
| 312 | - **.csv to match all the CSV files at any level |
| 313 | - data/* to match all the files inside "data" |
| 314 | - data/** to match all the files inside "data" and its subdirectories |
| 315 | |
| 316 | The patterns are resolved using the fsspec glob. In fsspec>=2023.12.0 this is equivalent to |
| 317 | Python's glob.glob, Path.glob, Path.match and fnmatch where ** is unsupported with a prefix/suffix |
| 318 | other than a forward slash /. |
| 319 | |
| 320 | More generally: |
| 321 | - '*' matches any character except a forward-slash (to match just the file or directory name) |
| 322 | - '**' matches any character including a forward-slash / |
| 323 | |
| 324 | Hidden files and directories (i.e. whose names start with a dot) are ignored, unless they are explicitly requested. |
| 325 | The same applies to special directories that start with a double underscore like "__pycache__". |
| 326 | You can still include one if the pattern explicitly mentions it: |
| 327 | - to include a hidden file: "*/.hidden.txt" or "*/.*" |
| 328 | - to include a hidden directory: ".hidden/*" or ".*/*" |
| 329 | - to include a special directory: "__special__/*" or "__*/*" |
| 330 | |
| 331 | Example:: |
| 332 | |
| 333 | >>> from datasets.data_files import resolve_pattern |
| 334 | >>> base_path = "." |
| 335 | >>> resolve_pattern("docs/**/*.py", base_path) |
| 336 | [/Users/mariosasko/Desktop/projects/datasets/docs/source/_config.py'] |
| 337 | |
| 338 | Args: |
| 339 | pattern (str): Unix pattern or paths or URLs of the data files to resolve. |
| 340 | The paths can be absolute or relative to base_path. |
| 341 | Remote filesystems using fsspec are supported, e.g. with the hf:// protocol. |
| 342 | base_path (str): Base path to use when resolving relative paths. |
| 343 | allowed_extensions (Optional[list], optional): White-list of file extensions to use. Defaults to None (all extensions). |
| 344 | For example: allowed_extensions=[".csv", ".json", ".txt", ".parquet"] |
| 345 | download_config ([`DownloadConfig`], *optional*): Specific download configuration parameters. |
| 346 | Returns: |
| 347 | List[str]: List of paths or URLs to the local or remote files that match the patterns. |
| 348 | """ |
| 349 | if is_relative_path(pattern): |
| 350 | pattern = xjoin(base_path, pattern) |
| 351 | elif is_local_path(pattern): |
| 352 | base_path = os.path.splitdrive(pattern)[0] + os.sep |
| 353 | else: |
| 354 | base_path = "" |
| 355 | pattern, storage_options = _prepare_path_and_storage_options(pattern, download_config=download_config) |
| 356 | fs, fs_pattern = url_to_fs(pattern, **storage_options) |
| 357 | files_to_ignore = set(FILES_TO_IGNORE) - {xbasename(pattern)} |
| 358 | protocol = ( |