MCPcopy
hub / github.com/dask/dask / create_metadata_file

Function create_metadata_file

dask/dataframe/io/parquet/core.py:226–356  ·  view source on GitHub ↗

Construct a global _metadata file from a list of parquet files. Dask's read_parquet function is designed to leverage a global _metadata file whenever one is available. The to_parquet function will generate this file automatically by default, but it may not exist if the dataset was

(
    paths,
    root_dir=None,
    out_dir=None,
    engine="pyarrow",
    storage_options=None,
    split_every=32,
    compute=True,
    compute_kwargs=None,
    fs=None,
)

Source from the content-addressed store, hash-verified

224
225
226def create_metadata_file(
227 paths,
228 root_dir=None,
229 out_dir=None,
230 engine="pyarrow",
231 storage_options=None,
232 split_every=32,
233 compute=True,
234 compute_kwargs=None,
235 fs=None,
236):
237 """Construct a global _metadata file from a list of parquet files.
238
239 Dask's read_parquet function is designed to leverage a global
240 _metadata file whenever one is available. The to_parquet
241 function will generate this file automatically by default, but it
242 may not exist if the dataset was generated outside of Dask. This
243 utility provides a mechanism to generate a _metadata file from a
244 list of existing parquet files.
245
246 Parameters
247 ----------
248 paths : list(string)
249 List of files to collect footer metadata from.
250 root_dir : string, optional
251 Root directory of dataset. The `file_path` fields in the new
252 _metadata file will relative to this directory. If None, a common
253 root directory will be inferred.
254 out_dir : string or False, optional
255 Directory location to write the final _metadata file. By default,
256 this will be set to `root_dir`. If False is specified, the global
257 metadata will be returned as an in-memory object (and will not be
258 written to disk).
259 engine : str or Engine, default 'pyarrow'
260 Parquet Engine to use. Only 'pyarrow' is supported if a string
261 is passed.
262 storage_options : dict, optional
263 Key/value pairs to be passed on to the file-system backend, if any.
264 split_every : int, optional
265 The final metadata object that is written to _metadata can be much
266 smaller than the list of footer metadata. In order to avoid the
267 aggregation of all metadata within a single task, a tree reduction
268 is used. This argument specifies the maximum number of metadata
269 inputs to be handled by any one task in the tree. Defaults to 32.
270 compute : bool, optional
271 If True (default) then the result is computed immediately. If False
272 then a ``dask.delayed`` object is returned for future computation.
273 compute_kwargs : dict, optional
274 Options to be passed in to the compute method
275 fs : fsspec object, optional
276 File-system instance to use for file handling. If prefixes have
277 been removed from the elements of ``paths`` before calling this
278 function, an ``fs`` argument must be provided to ensure correct
279 behavior on remote file systems ("naked" paths cannot be used
280 to infer file-system information).
281 """
282 if isinstance(engine, str):
283 engine = get_engine(engine)

Callers

nothing calls this directly

Calls 7

_sort_and_analyze_pathsFunction · 0.90
DelayedClass · 0.90
get_engineFunction · 0.85
minFunction · 0.85
from_collectionsMethod · 0.80
tokenizeFunction · 0.50
computeMethod · 0.45

Tested by

no test coverage detected

Used in the wild real call sites across dependent graphs

searching dependent graphs…