MCPcopy
hub / github.com/dask/dask / read_text

Function read_text

dask/bag/text.py:17–161  ·  view source on GitHub ↗

Read lines from text files Parameters ---------- urlpath : string or list Absolute or relative filepath(s). Prefix with a protocol like ``s3://`` to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with

(
    urlpath,
    blocksize=None,
    compression="infer",
    encoding=system_encoding,
    errors="strict",
    linedelimiter=None,
    collection=True,
    storage_options=None,
    files_per_partition=None,
    include_path=False,
)

Source from the content-addressed store, hash-verified

15
16
17def read_text(
18 urlpath,
19 blocksize=None,
20 compression="infer",
21 encoding=system_encoding,
22 errors="strict",
23 linedelimiter=None,
24 collection=True,
25 storage_options=None,
26 files_per_partition=None,
27 include_path=False,
28):
29 """Read lines from text files
30
31 Parameters
32 ----------
33 urlpath : string or list
34 Absolute or relative filepath(s). Prefix with a protocol like ``s3://``
35 to read from alternative filesystems. To read from multiple files you
36 can pass a globstring or a list of paths, with the caveat that they
37 must all have the same protocol.
38 blocksize: None, int, or str
39 Size (in bytes) to cut up larger files. Streams by default.
40 Can be ``None`` for streaming, an integer number of bytes, or a string
41 like "128MiB"
42 compression: string
43 Compression format like 'gzip' or 'xz'. Defaults to 'infer'
44 encoding: string
45 errors: string
46 linedelimiter: string or None
47 collection: bool, optional
48 Return dask.bag if True, or list of delayed values if false
49 storage_options: dict
50 Extra options that make sense to a particular storage connection, e.g.
51 host, port, username, password, etc.
52 files_per_partition: None or int
53 If set, group input files into partitions of the requested size,
54 instead of one partition per file. Mutually exclusive with blocksize.
55 include_path: bool
56 Whether or not to include the path in the bag.
57 If true, elements are tuples of (line, path).
58 Default is False.
59
60 Examples
61 --------
62 >>> b = read_text('myfiles.1.txt') # doctest: +SKIP
63 >>> b = read_text('myfiles.*.txt') # doctest: +SKIP
64 >>> b = read_text('myfiles.*.txt.gz') # doctest: +SKIP
65 >>> b = read_text('s3://bucket/myfiles.*.txt') # doctest: +SKIP
66 >>> b = read_text('s3://key:secret@bucket/myfiles.*.txt') # doctest: +SKIP
67 >>> b = read_text('hdfs://namenode.example.com/myfiles.*.txt') # doctest: +SKIP
68
69 Parallelize a large file by providing the number of uncompressed bytes to
70 load into each partition.
71
72 >>> b = read_text('largefile.txt', blocksize='10MB') # doctest: +SKIP
73
74 Get file paths of the bag by setting include_path=True

Callers 5

test_read_textFunction · 0.90
test_files_per_partitionFunction · 0.90
test_errorsFunction · 0.90
test_complex_delimiterFunction · 0.90

Calls 5

parse_bytesFunction · 0.90
delayedFunction · 0.90
read_bytesFunction · 0.90
from_delayedFunction · 0.90
concatFunction · 0.70

Tested by 5

test_read_textFunction · 0.72
test_files_per_partitionFunction · 0.72
test_errorsFunction · 0.72
test_complex_delimiterFunction · 0.72

Used in the wild real call sites across dependent graphs

searching dependent graphs…