Resolves a dataset URL, makes a connection via pyarrow, and provides a filesystem object.
| 39 | |
| 40 | |
| 41 | class FilesystemResolver(object): |
| 42 | """Resolves a dataset URL, makes a connection via pyarrow, and provides a filesystem object.""" |
| 43 | |
| 44 | def __init__(self, dataset_url, hadoop_configuration=None, connector=HdfsConnector, |
| 45 | hdfs_driver='libhdfs3', user=None, storage_options=None): |
| 46 | """ |
| 47 | Given a dataset URL and an optional hadoop configuration, parse and interpret the URL to |
| 48 | instantiate a pyarrow filesystem. |
| 49 | |
| 50 | Interpretation of the URL ``scheme://hostname:port/path`` occurs in the following order: |
| 51 | |
| 52 | 1. If no ``scheme``, no longer supported, so raise an exception! |
| 53 | 2. If ``scheme`` is ``file``, use local filesystem path. |
| 54 | 3. If ``scheme`` is ``hdfs``: |
| 55 | a. Try the ``hostname`` as a namespace and attempt to connect to a name node. |
| 56 | 1. If that doesn't work, try connecting directly to namenode ``hostname:port``. |
| 57 | b. If no host, connect to the default name node. |
| 58 | 5. If ``scheme`` is ``s3``, use s3fs. The user must manually install s3fs before using s3 |
| 59 | 6. If ``scheme`` is ``gs``or ``gcs``, use gcsfs. The user must manually install gcsfs before using GCS |
| 60 | 7. Fail otherwise. |
| 61 | |
| 62 | :param dataset_url: The hdfs URL or absolute path to the dataset |
| 63 | :param hadoop_configuration: an optional hadoop configuration |
| 64 | :param connector: the HDFS connector object to use (ONLY override for testing purposes) |
| 65 | :param hdfs_driver: A string denoting the hdfs driver to use (if using a dataset on hdfs). Current choices are |
| 66 | libhdfs (java through JNI) or libhdfs3 (C++) |
| 67 | :param user: String denoting username when connecting to HDFS. None implies login user. |
| 68 | :param storage_options: Dict of kwargs forwarded to ``fsspec`` to initialize the filesystem. |
| 69 | """ |
| 70 | # Cache both the original URL and the resolved, urlparsed dataset_url |
| 71 | self._dataset_url = dataset_url |
| 72 | self._parsed_dataset_url = None |
| 73 | # Cache the instantiated filesystem object |
| 74 | self._filesystem = None |
| 75 | |
| 76 | if isinstance(self._dataset_url, six.string_types): |
| 77 | self._parsed_dataset_url = urlparse(self._dataset_url) |
| 78 | else: |
| 79 | self._parsed_dataset_url = self._dataset_url |
| 80 | |
| 81 | if not self._parsed_dataset_url.scheme: |
| 82 | # Case 1 |
| 83 | raise ValueError('ERROR! A scheme-less dataset url ({}) is no longer supported. ' |
| 84 | 'Please prepend "file://" for local filesystem.'.format(self._parsed_dataset_url.scheme)) |
| 85 | |
| 86 | elif self._parsed_dataset_url.scheme == 'file': |
| 87 | # Case 2: definitely local |
| 88 | self._filesystem = pyarrow.localfs |
| 89 | self._filesystem_factory = lambda: pyarrow.localfs |
| 90 | |
| 91 | elif self._parsed_dataset_url.scheme == 'hdfs': |
| 92 | |
| 93 | if hdfs_driver == 'libhdfs3': |
| 94 | # libhdfs3 does not do any namenode resolution itself so we do it manually. This is not necessary |
| 95 | # if using libhdfs |
| 96 | |
| 97 | # Obtain singleton and force hadoop config evaluation |
| 98 | namenode_resolver = HdfsNamenodeResolver(hadoop_configuration) |
no outgoing calls
searching dependent graphs…