Read lines from text files Parameters ---------- urlpath : string or list Absolute or relative filepath(s). Prefix with a protocol like ``s3://`` to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with
(
urlpath,
blocksize=None,
compression="infer",
encoding=system_encoding,
errors="strict",
linedelimiter=None,
collection=True,
storage_options=None,
files_per_partition=None,
include_path=False,
)
| 15 | |
| 16 | |
| 17 | def read_text( |
| 18 | urlpath, |
| 19 | blocksize=None, |
| 20 | compression="infer", |
| 21 | encoding=system_encoding, |
| 22 | errors="strict", |
| 23 | linedelimiter=None, |
| 24 | collection=True, |
| 25 | storage_options=None, |
| 26 | files_per_partition=None, |
| 27 | include_path=False, |
| 28 | ): |
| 29 | """Read lines from text files |
| 30 | |
| 31 | Parameters |
| 32 | ---------- |
| 33 | urlpath : string or list |
| 34 | Absolute or relative filepath(s). Prefix with a protocol like ``s3://`` |
| 35 | to read from alternative filesystems. To read from multiple files you |
| 36 | can pass a globstring or a list of paths, with the caveat that they |
| 37 | must all have the same protocol. |
| 38 | blocksize: None, int, or str |
| 39 | Size (in bytes) to cut up larger files. Streams by default. |
| 40 | Can be ``None`` for streaming, an integer number of bytes, or a string |
| 41 | like "128MiB" |
| 42 | compression: string |
| 43 | Compression format like 'gzip' or 'xz'. Defaults to 'infer' |
| 44 | encoding: string |
| 45 | errors: string |
| 46 | linedelimiter: string or None |
| 47 | collection: bool, optional |
| 48 | Return dask.bag if True, or list of delayed values if false |
| 49 | storage_options: dict |
| 50 | Extra options that make sense to a particular storage connection, e.g. |
| 51 | host, port, username, password, etc. |
| 52 | files_per_partition: None or int |
| 53 | If set, group input files into partitions of the requested size, |
| 54 | instead of one partition per file. Mutually exclusive with blocksize. |
| 55 | include_path: bool |
| 56 | Whether or not to include the path in the bag. |
| 57 | If true, elements are tuples of (line, path). |
| 58 | Default is False. |
| 59 | |
| 60 | Examples |
| 61 | -------- |
| 62 | >>> b = read_text('myfiles.1.txt') # doctest: +SKIP |
| 63 | >>> b = read_text('myfiles.*.txt') # doctest: +SKIP |
| 64 | >>> b = read_text('myfiles.*.txt.gz') # doctest: +SKIP |
| 65 | >>> b = read_text('s3://bucket/myfiles.*.txt') # doctest: +SKIP |
| 66 | >>> b = read_text('s3://key:secret@bucket/myfiles.*.txt') # doctest: +SKIP |
| 67 | >>> b = read_text('hdfs://namenode.example.com/myfiles.*.txt') # doctest: +SKIP |
| 68 | |
| 69 | Parallelize a large file by providing the number of uncompressed bytes to |
| 70 | load into each partition. |
| 71 | |
| 72 | >>> b = read_text('largefile.txt', blocksize='10MB') # doctest: +SKIP |
| 73 | |
| 74 | Get file paths of the bag by setting include_path=True |
searching dependent graphs…