MCPcopy
hub / github.com/uber/petastorm / get_schema_from_dataset_url

Function get_schema_from_dataset_url

petastorm/etl/dataset_metadata.py:388–407  ·  view source on GitHub ↗

Returns a :class:`petastorm.unischema.Unischema` object loaded from a dataset specified by a url. :param dataset_url_or_urls: a url to a parquet directory or a url list (with the same scheme) to parquet files. :param hdfs_driver: A string denoting the hdfs driver to use (if using a dataset

(dataset_url_or_urls, hdfs_driver='libhdfs3', storage_options=None, filesystem=None)

Source from the content-addressed store, hash-verified

386
387
388def get_schema_from_dataset_url(dataset_url_or_urls, hdfs_driver='libhdfs3', storage_options=None, filesystem=None):
389 """Returns a :class:`petastorm.unischema.Unischema` object loaded from a dataset specified by a url.
390
391 :param dataset_url_or_urls: a url to a parquet directory or a url list (with the same scheme) to parquet files.
392 :param hdfs_driver: A string denoting the hdfs driver to use (if using a dataset on hdfs). Current choices are
393 libhdfs (java through JNI) or libhdfs3 (C++)
394 :param storage_options: Dict of kwargs forwarded to ``fsspec`` to initialize the filesystem.
395 :param fileystem: the ``pyarrow.FileSystem`` to use.
396 :return: A :class:`petastorm.unischema.Unischema` object
397 """
398 fs, path_or_paths = get_filesystem_and_path_or_paths(dataset_url_or_urls, hdfs_driver,
399 storage_options=storage_options,
400 filesystem=filesystem)
401
402 dataset = pq.ParquetDataset(path_or_paths, filesystem=fs, validate_schema=False, metadata_nthreads=10)
403
404 # Get a unischema stored in the dataset metadata.
405 stored_schema = get_schema(dataset)
406
407 return stored_schema
408
409
410def infer_or_load_unischema(dataset):

Callers 5

dataset_as_rddFunction · 0.90
copy_datasetFunction · 0.90
reader_throughputFunction · 0.90

Calls 2

get_schemaFunction · 0.85

Used in the wild real call sites across dependent graphs

searching dependent graphs…