MCPcopy
hub / github.com/dask/dask / read_bytes

Function read_bytes

dask/bytes/core.py:14–187  ·  view source on GitHub ↗

Given a path or paths, return delayed objects that read from those paths. The path may be a filename like ``'2015-01-01.csv'`` or a globstring like ``'2015-*-*.csv'``. The path may be preceded by a protocol, like ``s3://`` or ``hdfs://`` if those libraries are installed. This

(
    urlpath,
    delimiter=None,
    not_zero=False,
    blocksize="128 MiB",
    sample="10 kiB",
    compression=None,
    include_path=False,
    **kwargs,
)

Source from the content-addressed store, hash-verified

12
13
14def read_bytes(
15 urlpath,
16 delimiter=None,
17 not_zero=False,
18 blocksize="128 MiB",
19 sample="10 kiB",
20 compression=None,
21 include_path=False,
22 **kwargs,
23):
24 """Given a path or paths, return delayed objects that read from those paths.
25
26 The path may be a filename like ``'2015-01-01.csv'`` or a globstring
27 like ``'2015-*-*.csv'``.
28
29 The path may be preceded by a protocol, like ``s3://`` or ``hdfs://`` if
30 those libraries are installed.
31
32 This cleanly breaks data by a delimiter if given, so that block boundaries
33 start directly after a delimiter and end on the delimiter.
34
35 Parameters
36 ----------
37 urlpath : string or list
38 Absolute or relative filepath(s). Prefix with a protocol like ``s3://``
39 to read from alternative filesystems. To read from multiple files you
40 can pass a globstring or a list of paths, with the caveat that they
41 must all have the same protocol.
42 delimiter : bytes
43 An optional delimiter, like ``b'\\n'`` on which to split blocks of
44 bytes.
45 not_zero : bool
46 Force seek of start-of-file delimiter, discarding header.
47 blocksize : int, str
48 Chunk size in bytes, defaults to "128 MiB"
49 compression : string or None
50 String like 'gzip' or 'xz'. Must support efficient random access.
51 sample : int, string, or boolean
52 Whether or not to return a header sample.
53 Values can be ``False`` for "no sample requested"
54 Or an integer or string value like ``2**20`` or ``"1 MiB"``
55 include_path : bool
56 Whether or not to include the path with the bytes representing a particular file.
57 Default is False.
58 **kwargs : dict
59 Extra options that make sense to a particular storage connection, e.g.
60 host, port, username, password, etc.
61
62 Examples
63 --------
64 >>> sample, blocks = read_bytes('2015-*-*.csv', delimiter=b'\\n') # doctest: +SKIP
65 >>> sample, blocks = read_bytes('s3://bucket/2015-*-*.csv', delimiter=b'\\n') # doctest: +SKIP
66 >>> sample, paths, blocks = read_bytes('2015-*-*.csv', include_path=True) # doctest: +SKIP
67
68 Returns
69 -------
70 sample : bytes
71 The sample header

Callers 15

test_read_bytesFunction · 0.90
test_parse_sample_bytesFunction · 0.90
test_with_urlsFunction · 0.90
test_with_pathsFunction · 0.90
test_read_bytes_blockFunction · 0.90

Calls 6

parse_bytesFunction · 0.90
is_integerFunction · 0.90
delayedFunction · 0.90
infoMethod · 0.80
splitMethod · 0.80
tokenizeFunction · 0.50

Tested by 15

test_read_bytesFunction · 0.72
test_parse_sample_bytesFunction · 0.72
test_with_urlsFunction · 0.72
test_with_pathsFunction · 0.72
test_read_bytes_blockFunction · 0.72

Used in the wild real call sites across dependent graphs

searching dependent graphs…