hub / github.com/uber/petastorm / FilesystemResolver

Class FilesystemResolver

petastorm/fs_utils.py:41–176 · view source on GitHub ↗

Resolves a dataset URL, makes a connection via pyarrow, and provides a filesystem object.

Source from the content-addressed store, hash-verified

39
40
41	class FilesystemResolver(object):
42	"""Resolves a dataset URL, makes a connection via pyarrow, and provides a filesystem object."""
43
44	def __init__(self, dataset_url, hadoop_configuration=None, connector=HdfsConnector,
45	hdfs_driver='libhdfs3', user=None, storage_options=None):
46	"""
47	Given a dataset URL and an optional hadoop configuration, parse and interpret the URL to
48	instantiate a pyarrow filesystem.
49
50	Interpretation of the URL ``scheme://hostname:port/path`` occurs in the following order:
51
52	1. If no ``scheme``, no longer supported, so raise an exception!
53	2. If ``scheme`` is ``file``, use local filesystem path.
54	3. If ``scheme`` is ``hdfs``:
55	a. Try the ``hostname`` as a namespace and attempt to connect to a name node.
56	1. If that doesn't work, try connecting directly to namenode ``hostname:port``.
57	b. If no host, connect to the default name node.
58	5. If ``scheme`` is ``s3``, use s3fs. The user must manually install s3fs before using s3
59	6. If ``scheme`` is ``gs``or ``gcs``, use gcsfs. The user must manually install gcsfs before using GCS
60	7. Fail otherwise.
61
62	:param dataset_url: The hdfs URL or absolute path to the dataset
63	:param hadoop_configuration: an optional hadoop configuration
64	:param connector: the HDFS connector object to use (ONLY override for testing purposes)
65	:param hdfs_driver: A string denoting the hdfs driver to use (if using a dataset on hdfs). Current choices are
66	libhdfs (java through JNI) or libhdfs3 (C++)
67	:param user: String denoting username when connecting to HDFS. None implies login user.
68	:param storage_options: Dict of kwargs forwarded to ``fsspec`` to initialize the filesystem.
69	"""
70	# Cache both the original URL and the resolved, urlparsed dataset_url
71	self._dataset_url = dataset_url
72	self._parsed_dataset_url = None
73	# Cache the instantiated filesystem object
74	self._filesystem = None
75
76	if isinstance(self._dataset_url, six.string_types):
77	self._parsed_dataset_url = urlparse(self._dataset_url)
78	else:
79	self._parsed_dataset_url = self._dataset_url
80
81	if not self._parsed_dataset_url.scheme:
82	# Case 1
83	raise ValueError('ERROR! A scheme-less dataset url ({}) is no longer supported. '
84	'Please prepend "file://" for local filesystem.'.format(self._parsed_dataset_url.scheme))
85
86	elif self._parsed_dataset_url.scheme == 'file':
87	# Case 2: definitely local
88	self._filesystem = pyarrow.localfs
89	self._filesystem_factory = lambda: pyarrow.localfs
90
91	elif self._parsed_dataset_url.scheme == 'hdfs':
92
93	if hdfs_driver == 'libhdfs3':
94	# libhdfs3 does not do any namenode resolution itself so we do it manually. This is not necessary
95	# if using libhdfs
96
97	# Obtain singleton and force hadoop config evaluation
98	namenode_resolver = HdfsNamenodeResolver(hadoop_configuration)

Callers 15

dataset_as_rddFunction · 0.90

copy_datasetFunction · 0.90

materialize_datasetFunction · 0.90

metadata_util.pyFile · 0.90

build_rowgroup_indexFunction · 0.90

_index_columnsFunction · 0.90

generate_petastorm_metadataFunction · 0.90

_default_delete_dir_handlerFunction · 0.90

test_error_url_casesMethod · 0.90

test_file_urlMethod · 0.90

test_hdfs_url_with_nameserviceMethod · 0.90

test_hdfs_url_no_nameserviceMethod · 0.90

Calls

no outgoing calls

Tested by 10

test_error_url_casesMethod · 0.72

test_file_urlMethod · 0.72

test_hdfs_url_with_nameserviceMethod · 0.72

test_hdfs_url_no_nameserviceMethod · 0.72

test_hdfs_url_direct_namenodeMethod · 0.72

test_hdfs_url_direct_namenode_driver_libhdfsMethod · 0.72

test_hdfs_url_direct_namenode_retriesMethod · 0.72

test_s3_urlMethod · 0.72

test_gcs_urlMethod · 0.72

test_atexitFunction · 0.72

Used in the wild real call sites across dependent graphs

searching dependent graphs…