hub / github.com/uber/petastorm / get_schema_from_dataset_url

Function get_schema_from_dataset_url

petastorm/etl/dataset_metadata.py:388–407 · view source on GitHub ↗

Returns a :class:`petastorm.unischema.Unischema` object loaded from a dataset specified by a url. :param dataset_url_or_urls: a url to a parquet directory or a url list (with the same scheme) to parquet files. :param hdfs_driver: A string denoting the hdfs driver to use (if using a dataset

(dataset_url_or_urls, hdfs_driver='libhdfs3', storage_options=None, filesystem=None)

Source from the content-addressed store, hash-verified

386
387
388	def get_schema_from_dataset_url(dataset_url_or_urls, hdfs_driver='libhdfs3', storage_options=None, filesystem=None):
389	"""Returns a :class:`petastorm.unischema.Unischema` object loaded from a dataset specified by a url.
390
391	:param dataset_url_or_urls: a url to a parquet directory or a url list (with the same scheme) to parquet files.
392	:param hdfs_driver: A string denoting the hdfs driver to use (if using a dataset on hdfs). Current choices are
393	libhdfs (java through JNI) or libhdfs3 (C++)
394	:param storage_options: Dict of kwargs forwarded to ``fsspec`` to initialize the filesystem.
395	:param fileystem: the ``pyarrow.FileSystem`` to use.
396	:return: A :class:`petastorm.unischema.Unischema` object
397	"""
398	fs, path_or_paths = get_filesystem_and_path_or_paths(dataset_url_or_urls, hdfs_driver,
399	storage_options=storage_options,
400	filesystem=filesystem)
401
402	dataset = pq.ParquetDataset(path_or_paths, filesystem=fs, validate_schema=False, metadata_nthreads=10)
403
404	# Get a unischema stored in the dataset metadata.
405	stored_schema = get_schema(dataset)
406
407	return stored_schema
408
409
410	def infer_or_load_unischema(dataset):

Callers 5

dataset_as_rddFunction · 0.90

copy_datasetFunction · 0.90

reader_throughputFunction · 0.90

test_get_schema_from_dataset_urlFunction · 0.90

test_get_schema_from_dataset_url_bogus_urlFunction · 0.90

Calls 2

get_filesystem_and_path_or_pathsFunction · 0.90

get_schemaFunction · 0.85

Tested by 2

test_get_schema_from_dataset_urlFunction · 0.72

test_get_schema_from_dataset_url_bogus_urlFunction · 0.72

Used in the wild real call sites across dependent graphs

searching dependent graphs…