Function read_text

dask/bag/text.py:17–161 · view source on GitHub ↗

Read lines from text files Parameters ---------- urlpath : string or list Absolute or relative filepath(s). Prefix with a protocol like ``s3://`` to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with

(
    urlpath,
    blocksize=None,
    compression="infer",
    encoding=system_encoding,
    errors="strict",
    linedelimiter=None,
    collection=True,
    storage_options=None,
    files_per_partition=None,
    include_path=False,
)

Source from the content-addressed store, hash-verified

15
16
17	def read_text(
18	urlpath,
19	blocksize=None,
20	compression="infer",
21	encoding=system_encoding,
22	errors="strict",
23	linedelimiter=None,
24	collection=True,
25	storage_options=None,
26	files_per_partition=None,
27	include_path=False,
28	):
29	"""Read lines from text files
30
31	Parameters
32	----------
33	urlpath : string or list
34	Absolute or relative filepath(s). Prefix with a protocol like ``s3://``
35	to read from alternative filesystems. To read from multiple files you
36	can pass a globstring or a list of paths, with the caveat that they
37	must all have the same protocol.
38	blocksize: None, int, or str
39	Size (in bytes) to cut up larger files. Streams by default.
40	Can be ``None`` for streaming, an integer number of bytes, or a string
41	like "128MiB"
42	compression: string
43	Compression format like 'gzip' or 'xz'. Defaults to 'infer'
44	encoding: string
45	errors: string
46	linedelimiter: string or None
47	collection: bool, optional
48	Return dask.bag if True, or list of delayed values if false
49	storage_options: dict
50	Extra options that make sense to a particular storage connection, e.g.
51	host, port, username, password, etc.
52	files_per_partition: None or int
53	If set, group input files into partitions of the requested size,
54	instead of one partition per file. Mutually exclusive with blocksize.
55	include_path: bool
56	Whether or not to include the path in the bag.
57	If true, elements are tuples of (line, path).
58	Default is False.
59
60	Examples
61	--------
62	>>> b = read_text('myfiles.1.txt') # doctest: +SKIP
63	>>> b = read_text('myfiles.*.txt') # doctest: +SKIP
64	>>> b = read_text('myfiles.*.txt.gz') # doctest: +SKIP
65	>>> b = read_text('s3://bucket/myfiles.*.txt') # doctest: +SKIP
66	>>> b = read_text('s3://key:secret@bucket/myfiles.*.txt') # doctest: +SKIP
67	>>> b = read_text('hdfs://namenode.example.com/myfiles.*.txt') # doctest: +SKIP
68
69	Parallelize a large file by providing the number of uncompressed bytes to
70	load into each partition.
71
72	>>> b = read_text('largefile.txt', blocksize='10MB') # doctest: +SKIP
73
74	Get file paths of the bag by setting include_path=True

Callers 5

test_read_textFunction · 0.90

test_read_text_unicode_no_collectionFunction · 0.90

test_files_per_partitionFunction · 0.90

test_errorsFunction · 0.90

test_complex_delimiterFunction · 0.90

Calls 5

parse_bytesFunction · 0.90

delayedFunction · 0.90

read_bytesFunction · 0.90

from_delayedFunction · 0.90

concatFunction · 0.70

Tested by 5

test_read_textFunction · 0.72

test_read_text_unicode_no_collectionFunction · 0.72

test_files_per_partitionFunction · 0.72

test_errorsFunction · 0.72

test_complex_delimiterFunction · 0.72

Used in the wild real call sites across dependent graphs

searching dependent graphs…