Creates an instance of Reader for reading Petastorm datasets. A Petastorm dataset is a dataset generated using :func:`~petastorm.etl.dataset_metadata.materialize_dataset` context manager as explained `here <https://petastorm.readthedocs.io/en/latest/readme_include.html#generating-a-data
(dataset_url,
schema_fields=None,
reader_pool_type='thread', workers_count=10, pyarrow_serialize=False, results_queue_size=50,
seed=None, shuffle_rows=False,
shuffle_row_groups=True, shuffle_row_drop_partitions=1,
predicate=None,
rowgroup_selector=None,
num_epochs=1,
cur_shard=None, shard_count=None, shard_seed=None,
cache_type=NULL_CACHE, cache_location=None, cache_size_limit=None,
cache_row_size_estimate=None, cache_extra_settings=None,
hdfs_driver='libhdfs3',
transform_spec=None,
filters=None,
storage_options=None,
zmq_copy_buffers=True,
filesystem=None,
convert_early_to_numpy=False)
| 58 | |
| 59 | |
| 60 | def make_reader(dataset_url, |
| 61 | schema_fields=None, |
| 62 | reader_pool_type='thread', workers_count=10, pyarrow_serialize=False, results_queue_size=50, |
| 63 | seed=None, shuffle_rows=False, |
| 64 | shuffle_row_groups=True, shuffle_row_drop_partitions=1, |
| 65 | predicate=None, |
| 66 | rowgroup_selector=None, |
| 67 | num_epochs=1, |
| 68 | cur_shard=None, shard_count=None, shard_seed=None, |
| 69 | cache_type=NULL_CACHE, cache_location=None, cache_size_limit=None, |
| 70 | cache_row_size_estimate=None, cache_extra_settings=None, |
| 71 | hdfs_driver='libhdfs3', |
| 72 | transform_spec=None, |
| 73 | filters=None, |
| 74 | storage_options=None, |
| 75 | zmq_copy_buffers=True, |
| 76 | filesystem=None, |
| 77 | convert_early_to_numpy=False): |
| 78 | """ |
| 79 | Creates an instance of Reader for reading Petastorm datasets. A Petastorm dataset is a dataset generated using |
| 80 | :func:`~petastorm.etl.dataset_metadata.materialize_dataset` context manager as explained |
| 81 | `here <https://petastorm.readthedocs.io/en/latest/readme_include.html#generating-a-dataset>`_. |
| 82 | |
| 83 | See :func:`~petastorm.make_batch_reader` to read from a Parquet store that was not generated using |
| 84 | :func:`~petastorm.etl.dataset_metadata.materialize_dataset`. |
| 85 | |
| 86 | :param dataset_url: a url to a parquet directory or a url list (with the same scheme) to parquet files. |
| 87 | e.g. ``'hdfs://some_hdfs_cluster/user/yevgeni/parquet8'``, or ``'file:///tmp/mydataset'``, |
| 88 | or ``'s3://bucket/mydataset'``, or ``'gs://bucket/mydataset'``, |
| 89 | or ``[file:///tmp/mydataset/00000.parquet, file:///tmp/mydataset/00001.parquet]``. |
| 90 | :param schema_fields: Can be: a list of unischema fields and/or regex pattern strings; ``None`` to read all fields; |
| 91 | an NGram object, then it will return an NGram of the specified fields. |
| 92 | :param reader_pool_type: A string denoting the reader pool type. Should be one of ['thread', 'process', 'dummy'] |
| 93 | denoting a thread pool, process pool, or running everything in the master thread. Defaults to 'thread' |
| 94 | :param workers_count: An int for the number of workers to use in the reader pool. This only is used for the |
| 95 | thread or process pool. Defaults to 10 |
| 96 | :param pyarrow_serialize: THE ARGUMENT IS DEPRECATED AND WILL BE REMOVED IN FUTURE VERSIONS. |
| 97 | :param results_queue_size: Size of the results queue to store prefetched row-groups. Currently only applicable to |
| 98 | thread reader pool type. |
| 99 | :param seed: Random seed specified for shuffle and sharding with reproducible outputs. Defaults to None |
| 100 | :param shuffle_rows: Whether to shuffle inside a single row group. Defaults to False. |
| 101 | :param shuffle_row_groups: Whether to shuffle row groups (the order in which full row groups are read) |
| 102 | :param shuffle_row_drop_partitions: This is is a positive integer which determines how many partitions to |
| 103 | break up a row group into for increased shuffling in exchange for worse performance (extra reads). |
| 104 | For example if you specify 2 each row group read will drop half of the rows within every row group and |
| 105 | read the remaining rows in separate reads. It is recommended to keep this number below the regular row |
| 106 | group size in order to not waste reads which drop all rows. |
| 107 | :param predicate: instance of :class:`.PredicateBase` object to filter rows to be returned by reader. The predicate |
| 108 | will be passed a single row and must return a boolean value indicating whether to include it in the results. |
| 109 | :param rowgroup_selector: instance of row group selector object to select row groups to be read |
| 110 | :param num_epochs: An epoch is a single pass over all rows in the dataset. Setting ``num_epochs`` to |
| 111 | ``None`` will result in an infinite number of epochs. |
| 112 | :param cur_shard: An int denoting the current shard number. Each node reading a shard should |
| 113 | pass in a unique shard number in the range [0, shard_count). shard_count must be supplied as well. |
| 114 | Defaults to None |
| 115 | :param shard_count: An int denoting the number of shards to break this dataset into. Defaults to None |
| 116 | :param shard_seed: (Deprecated) Random seed used for sharding row groups. Defaults to None |
| 117 | :param cache_type: A string denoting the cache type, if desired. Options are [None, 'null', 'local-disk'] to |
searching dependent graphs…