MCPcopy
hub / github.com/uber/petastorm / FilesystemResolver

Class FilesystemResolver

petastorm/fs_utils.py:41–176  ·  view source on GitHub ↗

Resolves a dataset URL, makes a connection via pyarrow, and provides a filesystem object.

Source from the content-addressed store, hash-verified

39
40
41class FilesystemResolver(object):
42 """Resolves a dataset URL, makes a connection via pyarrow, and provides a filesystem object."""
43
44 def __init__(self, dataset_url, hadoop_configuration=None, connector=HdfsConnector,
45 hdfs_driver='libhdfs3', user=None, storage_options=None):
46 """
47 Given a dataset URL and an optional hadoop configuration, parse and interpret the URL to
48 instantiate a pyarrow filesystem.
49
50 Interpretation of the URL ``scheme://hostname:port/path`` occurs in the following order:
51
52 1. If no ``scheme``, no longer supported, so raise an exception!
53 2. If ``scheme`` is ``file``, use local filesystem path.
54 3. If ``scheme`` is ``hdfs``:
55 a. Try the ``hostname`` as a namespace and attempt to connect to a name node.
56 1. If that doesn't work, try connecting directly to namenode ``hostname:port``.
57 b. If no host, connect to the default name node.
58 5. If ``scheme`` is ``s3``, use s3fs. The user must manually install s3fs before using s3
59 6. If ``scheme`` is ``gs``or ``gcs``, use gcsfs. The user must manually install gcsfs before using GCS
60 7. Fail otherwise.
61
62 :param dataset_url: The hdfs URL or absolute path to the dataset
63 :param hadoop_configuration: an optional hadoop configuration
64 :param connector: the HDFS connector object to use (ONLY override for testing purposes)
65 :param hdfs_driver: A string denoting the hdfs driver to use (if using a dataset on hdfs). Current choices are
66 libhdfs (java through JNI) or libhdfs3 (C++)
67 :param user: String denoting username when connecting to HDFS. None implies login user.
68 :param storage_options: Dict of kwargs forwarded to ``fsspec`` to initialize the filesystem.
69 """
70 # Cache both the original URL and the resolved, urlparsed dataset_url
71 self._dataset_url = dataset_url
72 self._parsed_dataset_url = None
73 # Cache the instantiated filesystem object
74 self._filesystem = None
75
76 if isinstance(self._dataset_url, six.string_types):
77 self._parsed_dataset_url = urlparse(self._dataset_url)
78 else:
79 self._parsed_dataset_url = self._dataset_url
80
81 if not self._parsed_dataset_url.scheme:
82 # Case 1
83 raise ValueError('ERROR! A scheme-less dataset url ({}) is no longer supported. '
84 'Please prepend "file://" for local filesystem.'.format(self._parsed_dataset_url.scheme))
85
86 elif self._parsed_dataset_url.scheme == 'file':
87 # Case 2: definitely local
88 self._filesystem = pyarrow.localfs
89 self._filesystem_factory = lambda: pyarrow.localfs
90
91 elif self._parsed_dataset_url.scheme == 'hdfs':
92
93 if hdfs_driver == 'libhdfs3':
94 # libhdfs3 does not do any namenode resolution itself so we do it manually. This is not necessary
95 # if using libhdfs
96
97 # Obtain singleton and force hadoop config evaluation
98 namenode_resolver = HdfsNamenodeResolver(hadoop_configuration)

Callers 15

dataset_as_rddFunction · 0.90
copy_datasetFunction · 0.90
materialize_datasetFunction · 0.90
metadata_util.pyFile · 0.90
build_rowgroup_indexFunction · 0.90
_index_columnsFunction · 0.90
test_error_url_casesMethod · 0.90
test_file_urlMethod · 0.90

Calls

no outgoing calls

Used in the wild real call sites across dependent graphs

searching dependent graphs…