hub / github.com/dask/dask / create_metadata_file

Function create_metadata_file

dask/dataframe/io/parquet/core.py:226–356 · view source on GitHub ↗

Construct a global _metadata file from a list of parquet files. Dask's read_parquet function is designed to leverage a global _metadata file whenever one is available. The to_parquet function will generate this file automatically by default, but it may not exist if the dataset was

(
    paths,
    root_dir=None,
    out_dir=None,
    engine="pyarrow",
    storage_options=None,
    split_every=32,
    compute=True,
    compute_kwargs=None,
    fs=None,
)

Source from the content-addressed store, hash-verified

224
225
226	def create_metadata_file(
227	paths,
228	root_dir=None,
229	out_dir=None,
230	engine="pyarrow",
231	storage_options=None,
232	split_every=32,
233	compute=True,
234	compute_kwargs=None,
235	fs=None,
236	):
237	"""Construct a global _metadata file from a list of parquet files.
238
239	Dask's read_parquet function is designed to leverage a global
240	_metadata file whenever one is available. The to_parquet
241	function will generate this file automatically by default, but it
242	may not exist if the dataset was generated outside of Dask. This
243	utility provides a mechanism to generate a _metadata file from a
244	list of existing parquet files.
245
246	Parameters
247	----------
248	paths : list(string)
249	List of files to collect footer metadata from.
250	root_dir : string, optional
251	Root directory of dataset. The `file_path` fields in the new
252	_metadata file will relative to this directory. If None, a common
253	root directory will be inferred.
254	out_dir : string or False, optional
255	Directory location to write the final _metadata file. By default,
256	this will be set to `root_dir`. If False is specified, the global
257	metadata will be returned as an in-memory object (and will not be
258	written to disk).
259	engine : str or Engine, default 'pyarrow'
260	Parquet Engine to use. Only 'pyarrow' is supported if a string
261	is passed.
262	storage_options : dict, optional
263	Key/value pairs to be passed on to the file-system backend, if any.
264	split_every : int, optional
265	The final metadata object that is written to _metadata can be much
266	smaller than the list of footer metadata. In order to avoid the
267	aggregation of all metadata within a single task, a tree reduction
268	is used. This argument specifies the maximum number of metadata
269	inputs to be handled by any one task in the tree. Defaults to 32.
270	compute : bool, optional
271	If True (default) then the result is computed immediately. If False
272	then a ``dask.delayed`` object is returned for future computation.
273	compute_kwargs : dict, optional
274	Options to be passed in to the compute method
275	fs : fsspec object, optional
276	File-system instance to use for file handling. If prefixes have
277	been removed from the elements of ``paths`` before calling this
278	function, an ``fs`` argument must be provided to ensure correct
279	behavior on remote file systems ("naked" paths cannot be used
280	to infer file-system information).
281	"""
282	if isinstance(engine, str):
283	engine = get_engine(engine)

Callers

nothing calls this directly

Calls 7

_sort_and_analyze_pathsFunction · 0.90

DelayedClass · 0.90

get_engineFunction · 0.85

minFunction · 0.85

from_collectionsMethod · 0.80

tokenizeFunction · 0.50

computeMethod · 0.45

Tested by

no test coverage detected

Used in the wild real call sites across dependent graphs

searching dependent graphs…