Construct a global _metadata file from a list of parquet files. Dask's read_parquet function is designed to leverage a global _metadata file whenever one is available. The to_parquet function will generate this file automatically by default, but it may not exist if the dataset was
(
paths,
root_dir=None,
out_dir=None,
engine="pyarrow",
storage_options=None,
split_every=32,
compute=True,
compute_kwargs=None,
fs=None,
)
| 224 | |
| 225 | |
| 226 | def create_metadata_file( |
| 227 | paths, |
| 228 | root_dir=None, |
| 229 | out_dir=None, |
| 230 | engine="pyarrow", |
| 231 | storage_options=None, |
| 232 | split_every=32, |
| 233 | compute=True, |
| 234 | compute_kwargs=None, |
| 235 | fs=None, |
| 236 | ): |
| 237 | """Construct a global _metadata file from a list of parquet files. |
| 238 | |
| 239 | Dask's read_parquet function is designed to leverage a global |
| 240 | _metadata file whenever one is available. The to_parquet |
| 241 | function will generate this file automatically by default, but it |
| 242 | may not exist if the dataset was generated outside of Dask. This |
| 243 | utility provides a mechanism to generate a _metadata file from a |
| 244 | list of existing parquet files. |
| 245 | |
| 246 | Parameters |
| 247 | ---------- |
| 248 | paths : list(string) |
| 249 | List of files to collect footer metadata from. |
| 250 | root_dir : string, optional |
| 251 | Root directory of dataset. The `file_path` fields in the new |
| 252 | _metadata file will relative to this directory. If None, a common |
| 253 | root directory will be inferred. |
| 254 | out_dir : string or False, optional |
| 255 | Directory location to write the final _metadata file. By default, |
| 256 | this will be set to `root_dir`. If False is specified, the global |
| 257 | metadata will be returned as an in-memory object (and will not be |
| 258 | written to disk). |
| 259 | engine : str or Engine, default 'pyarrow' |
| 260 | Parquet Engine to use. Only 'pyarrow' is supported if a string |
| 261 | is passed. |
| 262 | storage_options : dict, optional |
| 263 | Key/value pairs to be passed on to the file-system backend, if any. |
| 264 | split_every : int, optional |
| 265 | The final metadata object that is written to _metadata can be much |
| 266 | smaller than the list of footer metadata. In order to avoid the |
| 267 | aggregation of all metadata within a single task, a tree reduction |
| 268 | is used. This argument specifies the maximum number of metadata |
| 269 | inputs to be handled by any one task in the tree. Defaults to 32. |
| 270 | compute : bool, optional |
| 271 | If True (default) then the result is computed immediately. If False |
| 272 | then a ``dask.delayed`` object is returned for future computation. |
| 273 | compute_kwargs : dict, optional |
| 274 | Options to be passed in to the compute method |
| 275 | fs : fsspec object, optional |
| 276 | File-system instance to use for file handling. If prefixes have |
| 277 | been removed from the elements of ``paths`` before calling this |
| 278 | function, an ``fs`` argument must be provided to ensure correct |
| 279 | behavior on remote file systems ("naked" paths cannot be used |
| 280 | to infer file-system information). |
| 281 | """ |
| 282 | if isinstance(engine, str): |
| 283 | engine = get_engine(engine) |
nothing calls this directly
no test coverage detected
searching dependent graphs…