MCPcopy
hub / github.com/uber/petastorm / make_reader

Function make_reader

petastorm/reader.py:60–206  ·  view source on GitHub ↗

Creates an instance of Reader for reading Petastorm datasets. A Petastorm dataset is a dataset generated using :func:`~petastorm.etl.dataset_metadata.materialize_dataset` context manager as explained `here <https://petastorm.readthedocs.io/en/latest/readme_include.html#generating-a-data

(dataset_url,
                schema_fields=None,
                reader_pool_type='thread', workers_count=10, pyarrow_serialize=False, results_queue_size=50,
                seed=None, shuffle_rows=False,
                shuffle_row_groups=True, shuffle_row_drop_partitions=1,
                predicate=None,
                rowgroup_selector=None,
                num_epochs=1,
                cur_shard=None, shard_count=None, shard_seed=None,
                cache_type=NULL_CACHE, cache_location=None, cache_size_limit=None,
                cache_row_size_estimate=None, cache_extra_settings=None,
                hdfs_driver='libhdfs3',
                transform_spec=None,
                filters=None,
                storage_options=None,
                zmq_copy_buffers=True,
                filesystem=None,
                convert_early_to_numpy=False)

Source from the content-addressed store, hash-verified

58
59
60def make_reader(dataset_url,
61 schema_fields=None,
62 reader_pool_type='thread', workers_count=10, pyarrow_serialize=False, results_queue_size=50,
63 seed=None, shuffle_rows=False,
64 shuffle_row_groups=True, shuffle_row_drop_partitions=1,
65 predicate=None,
66 rowgroup_selector=None,
67 num_epochs=1,
68 cur_shard=None, shard_count=None, shard_seed=None,
69 cache_type=NULL_CACHE, cache_location=None, cache_size_limit=None,
70 cache_row_size_estimate=None, cache_extra_settings=None,
71 hdfs_driver='libhdfs3',
72 transform_spec=None,
73 filters=None,
74 storage_options=None,
75 zmq_copy_buffers=True,
76 filesystem=None,
77 convert_early_to_numpy=False):
78 """
79 Creates an instance of Reader for reading Petastorm datasets. A Petastorm dataset is a dataset generated using
80 :func:`~petastorm.etl.dataset_metadata.materialize_dataset` context manager as explained
81 `here <https://petastorm.readthedocs.io/en/latest/readme_include.html#generating-a-dataset>`_.
82
83 See :func:`~petastorm.make_batch_reader` to read from a Parquet store that was not generated using
84 :func:`~petastorm.etl.dataset_metadata.materialize_dataset`.
85
86 :param dataset_url: a url to a parquet directory or a url list (with the same scheme) to parquet files.
87 e.g. ``'hdfs://some_hdfs_cluster/user/yevgeni/parquet8'``, or ``'file:///tmp/mydataset'``,
88 or ``'s3://bucket/mydataset'``, or ``'gs://bucket/mydataset'``,
89 or ``[file:///tmp/mydataset/00000.parquet, file:///tmp/mydataset/00001.parquet]``.
90 :param schema_fields: Can be: a list of unischema fields and/or regex pattern strings; ``None`` to read all fields;
91 an NGram object, then it will return an NGram of the specified fields.
92 :param reader_pool_type: A string denoting the reader pool type. Should be one of ['thread', 'process', 'dummy']
93 denoting a thread pool, process pool, or running everything in the master thread. Defaults to 'thread'
94 :param workers_count: An int for the number of workers to use in the reader pool. This only is used for the
95 thread or process pool. Defaults to 10
96 :param pyarrow_serialize: THE ARGUMENT IS DEPRECATED AND WILL BE REMOVED IN FUTURE VERSIONS.
97 :param results_queue_size: Size of the results queue to store prefetched row-groups. Currently only applicable to
98 thread reader pool type.
99 :param seed: Random seed specified for shuffle and sharding with reproducible outputs. Defaults to None
100 :param shuffle_rows: Whether to shuffle inside a single row group. Defaults to False.
101 :param shuffle_row_groups: Whether to shuffle row groups (the order in which full row groups are read)
102 :param shuffle_row_drop_partitions: This is is a positive integer which determines how many partitions to
103 break up a row group into for increased shuffling in exchange for worse performance (extra reads).
104 For example if you specify 2 each row group read will drop half of the rows within every row group and
105 read the remaining rows in separate reads. It is recommended to keep this number below the regular row
106 group size in order to not waste reads which drop all rows.
107 :param predicate: instance of :class:`.PredicateBase` object to filter rows to be returned by reader. The predicate
108 will be passed a single row and must return a boolean value indicating whether to include it in the results.
109 :param rowgroup_selector: instance of row group selector object to select row groups to be read
110 :param num_epochs: An epoch is a single pass over all rows in the dataset. Setting ``num_epochs`` to
111 ``None`` will result in an infinite number of epochs.
112 :param cur_shard: An int denoting the current shard number. Each node reading a shard should
113 pass in a unique shard number in the range [0, shard_count). shard_count must be supplied as well.
114 Defaults to None
115 :param shard_count: An int denoting the number of shards to break this dataset into. Defaults to None
116 :param shard_seed: (Deprecated) Random seed used for sharding row groups. Defaults to None
117 :param cache_type: A string denoting the cache type, if desired. Options are [None, 'null', 'local-disk'] to

Callers 15

train_and_testFunction · 0.90
mainFunction · 0.90
test_read_mnist_datasetFunction · 0.90
tensorflow_hello_worldFunction · 0.90
python_hello_worldFunction · 0.90
pytorch_hello_worldFunction · 0.90
test_generateFunction · 0.90
reader_throughputFunction · 0.90

Calls 9

NullCacheClass · 0.90
LocalDiskCacheClass · 0.90
ThreadPoolClass · 0.90
PickleSerializerClass · 0.90
ProcessPoolClass · 0.90
DummyPoolClass · 0.90
ReaderClass · 0.85

Used in the wild real call sites across dependent graphs

searching dependent graphs…