hub / github.com/piskvorky/smart_open

github.com/piskvorky/smart_open @v8.0.0 sqlite

repository ↗ · DeepWiki ↗ · release v8.0.0 ↗

997 symbols 2,870 edges 63 files 882 documented · 88%

README

smart_open — utils for streaming large files in Python

What?

smart_open is a Python 3 library for efficient streaming of very large files from/to storages such as S3, GCS, Azure Blob Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem. It supports transparent, on-the-fly (de-)compression for a variety of different formats.

smart_open is a drop-in replacement for Python's built-in open(): it can do anything open can (100% compatible, falls back to native open wherever possible), plus lots of nifty extra stuff on top.

Why?

Working with large remote files, for example using Amazon's boto3 Python library, is a pain. boto3's Object.upload_fileobj() and Object.download_fileobj() methods require gotcha-prone boilerplate to use successfully, such as constructing file-like object wrappers. smart_open shields you from that. It builds on boto3 and other remote storage libraries, but offers a clean unified Pythonic API. The result is less code for you to write and fewer bugs to make.

How?

smart_open is well-tested, well-documented, and has a simple Pythonic API:

>>> from smart_open import open
>>>
>>> # stream lines from an S3 object
>>> for line in open('s3://commoncrawl/robots.txt'):
...    print(repr(line))
...    break
'User-Agent: *\n'

>>> # stream from/to compressed files, with transparent (de)compression:
>>> for line in open('tests/test_data/1984.txt.gz', encoding='utf-8'):
...    print(repr(line))
'It was a bright cold day in April, and the clocks were striking thirteen.\n'
'Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n'
'wind, slipped quickly through the glass doors of Victory Mansions, though not\n'
'quickly enough to prevent a swirl of gritty dust from entering along with him.\n'

>>> # can use context managers too:
>>> with open('tests/test_data/1984.txt.gz') as fin:
...    with open('tests/test_data/1984.txt.bz2', 'w') as fout:
...        for line in fin:
...           fout.write(line)
74
80
78
79

>>> # can use any IOBase operations, like seek
>>> with open('s3://commoncrawl/robots.txt', 'rb') as fin:
...     for line in fin:
...         print(repr(line.decode('utf-8')))
...         break
...     offset = fin.seek(0)  # seek to the beginning
...     print(fin.read(4))
'User-Agent: *\n'
b'User'

>>> # stream from HTTP
>>> for line in open('http://example.com'):
...     print(repr(line[:15]))
...     break
'<!doctype html>'

For more examples of URIs that smart_open accepts, see help.txt or help('smart_open'). Some examples:

s3://bucket/key
s3://access_key_id:secret_access_key@bucket/key
gcs://bucket/blob
azure://bucket/blob
hdfs://host:port/path/file
./local/path/file.gz
file:///home/user/file.bz2
[ssh|scp|sftp]://username:password@host/path/file

Documentation

The API reference can be viewed at help.txt or using help('smart_open').

Installation

smart_open supports a wide range of storage solutions. For all options, see the API reference. Each individual solution has its own dependencies. By default, smart_open does not install any dependencies in order to keep the installation size small. You can install one or more of these dependencies explicitly using optional dependencies defined in pyproject.toml:

pip install 'smart_open[s3,gcs,azure,http,webhdfs,ssh,zst,lz4]'

Or, if you don't mind installing a large number of third party libraries, you can install all dependencies using:

pip install 'smart_open[all]'

Built-in help

To view the API reference, use the help python builtin:

help("smart_open")

or view help.txt in your browser.

More examples

For the sake of simplicity, the examples below assume you have all the dependencies installed, i.e. you have done:

pip install 'smart_open[all]'

import os, boto3, botocore
from smart_open import open

# stream content *into* S3 (write mode) using a custom client
# this client is thread-safe ref https://github.com/boto/boto3/blob/1.38.41/docs/source/guide/clients.rst?plain=1#L111
config = botocore.client.Config(
    max_pool_connections=64,
    tcp_keepalive=True,
    retries={"max_attempts": 6, "mode": "adaptive"},
)
client = boto3.Session(
    aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
    aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"],
).client("s3", config=config)
with open(
    "s3://smart-open-py37-benchmark-results/test.txt", "wb", transport_params={"client": client}
) as fout:
    bytes_written = fout.write(b"hello world!")
    print(bytes_written)

# perform a single-part upload to S3 (saves billable API requests, and allows seek() before upload)
with open(
    "s3://smart-open-py37-benchmark-results/test.txt", "wb", transport_params={"multipart_upload": False}
) as fout:
    bytes_written = fout.write(b"hello world!")
    print(bytes_written)
# now with tempfile.TemporaryFile instead of the default io.BytesIO (to reduce memory footprint)
import tempfile

with (
    tempfile.TemporaryFile() as tmp,
    open(
        "s3://smart-open-py37-benchmark-results/test.txt",
        "wb",
        transport_params={"multipart_upload": False, "writebuffer": tmp},
    ) as fout,
):
    bytes_written = fout.write(b"hello world!")
    print(bytes_written)

# stream from HDFS
for line in open("hdfs://host:port/user/hadoop/my_file.txt", encoding="utf8"):
    print(line)

# stream from WebHDFS
for line in open("webhdfs://host:port/user/hadoop/my_file.txt"):
    print(line)

# stream content *into* HDFS (write mode):
with open("hdfs://host:port/user/hadoop/my_file.txt", "wb") as fout:
    fout.write(b"hello world")

# stream content *into* WebHDFS (write mode):
with open("webhdfs://host:port/user/hadoop/my_file.txt", "wb") as fout:
    fout.write(b"hello world")

# stream from a completely custom s3 server, like s3proxy:
client = boto3.client(
    "s3", endpoint_url="http://host:port", aws_access_key_id="user", aws_secret_access_key="secret"
)
for line in open("s3://mybucket/mykey.txt", transport_params={"client": client}):
    print(line)

# Stream to Digital Ocean Spaces bucket providing credentials from boto3 profile
session = boto3.Session(profile_name="digitalocean")
client = session.client("s3", endpoint_url="https://ams3.digitaloceanspaces.com")
transport_params = {"client": client}
with open("s3://bucket/key.txt", "wb", transport_params=transport_params) as fout:
    fout.write(b"here we stand")

# stream from GCS
for line in open("gcs://my_bucket/my_file.txt"):
    print(line)

# stream content *into* GCS (write mode):
with open("gcs://my_bucket/my_file.txt", "wb") as fout:
    fout.write(b"hello world")

# stream from Azure Blob Storage
connect_str = os.environ["AZURE_STORAGE_CONNECTION_STRING"]
transport_params = {
    "client": azure.storage.blob.BlobServiceClient.from_connection_string(connect_str),
}
for line in open("azure://mycontainer/myfile.txt", transport_params=transport_params):
    print(line)

# stream content *into* Azure Blob Storage (write mode):
connect_str = os.environ["AZURE_STORAGE_CONNECTION_STRING"]
transport_params = {
    "client": azure.storage.blob.BlobServiceClient.from_connection_string(connect_str),
}
with open("azure://mycontainer/my_file.txt", "wb", transport_params=transport_params) as fout:
    fout.write(b"hello world")

Compression Handling

The top-level compression parameter controls compression/decompression behavior when reading and writing. The supported values for this parameter are:

infer_from_extension (default behavior)
disable
.bz2
.gz
.lz4
.xz
.zst

By default, smart_open automatically (de)compresses the file if the filename ends with one of these extensions. See also smart_open.compression.get_supported_compression_types and mart_open.compression.register_compressor.

>>> from smart_open import open
>>> with open('tests/test_data/1984.txt.gz') as fin:
...     print(fin.read(32))
It was a bright cold day in Apri

You can override this behavior to either disable compression, or explicitly specify the algorithm to use. To disable compression:

>>> from smart_open import open
>>> with open('tests/test_data/1984.txt.gz', 'rb', compression='disable') as fin:
...     print(fin.read(32))
b'\x1f\x8b\x08\x08\x85F\x94\\\x00\x031984.txt\x005\x8f=r\xc3@\x08\x85{\x9d\xe2\x1d@'

To specify the algorithm explicitly (e.g. for non-standard file extensions):

>>> from smart_open import open
>>> with open('tests/test_data/1984.txt.gzip', compression='.gz') as fin:
...     print(fin.read(32))
It was a bright cold day in Apri

To forward per-call options to the compression library (e.g. lower gzip's default compresslevel of 9 for faster writes), pass compression_kwargs:

>>> import tempfile
>>> from smart_open import open
>>> with tempfile.NamedTemporaryFile(suffix='.gz') as tmp:
...     with open(tmp.name, 'wb', compression_kwargs={'compresslevel': 6}) as fout:
...         _ = fout.write(b'hello world')

The dict is forwarded as-is; spell each option using the underlying library's own kwarg name (compresslevel for gzip/bz2, preset for xz, level for zstd, compression_level for lz4).

You can also easily add support for other file extensions and compression formats. For example, to open xz-compressed files:

>>> import lzma, os
>>> from smart_open import open, register_compressor

>>> def _handle_xz(file_obj, mode, **kwargs):
...      return lzma.open(filename=file_obj, mode=mode, **kwargs)

>>> register_compressor('.xz', _handle_xz)

>>> with open('tests/test_data/1984.txt.xz') as fin:
...     print(fin.read(32))
It was a bright cold day in Apri

This is just an example: lzma is in the standard library and is registered by default.

Transport-specific Options

smart_open supports a wide range of transport options out of the box. For the full list of supported URI schemes, see help.txt or help('smart_open'). Some examples:

AWS S3 (and any S3-Compatible)
HTTP, HTTPS (read-only)
SSH, SCP and SFTP
HDFS / WebHDFS
Google Cloud Storage
Azure Blob Storage

Each option involves setting up its own set of parameters. For example, for accessing S3, you often need to set up authentication, like API keys or a profile name. smart_open's open function accepts a keyword argument transport_params which accepts additional parameters for the transport layer. Here are some examples of using this parameter:

>>> import boto3
>>> fin = open('s3://commoncrawl/robots.txt', transport_params=dict(client=boto3.client('s3')))
>>> fin = open('s3://commoncrawl/robots.txt', transport_params=dict(buffer_size=1024))

For the full list of keyword arguments supported by each transport option, see help.txt or help('smart_open').

help("smart_open.open")

S3 Credentials

smart_open uses the boto3 library to talk to S3. boto3 has several mechanisms for determining the credentials to use. By default, smart_open will defer to boto3 and let the latter take care of the credentials. There are several ways to override this behavior.

The first is to pass a boto3.Client object as a transport parameter to the open function. You can customize the credentials when constructing the session for the client. smart_open will then use the session when talking to S3.

session = boto3.Session(
    aws_access_key_id=ACCESS_KEY,
    aws_secret_access_key=SECRET_KEY,
    aws_session_token=SESSION_TOKEN,
)
client = session.client("s3", endpoint_url=..., config=...)
fin = open("s3://bucket/key", transport_params={"client": client})

Your second option is to specify the credentials within the S3 URL itself:

fin = open("s3://aws_access_key_id:aws_secret_access_key@bucket/key", ...)

Important: The two methods above are mutually exclusive. If you pass an AWS client and the URL contains credentials, smart_open will ignore the latter.

Important: smart_open ignores configuration files from the older boto library. Port your old boto settings to boto3 in order to use them with smart_open.

S3 Advanced Usage

Additional keyword arguments can be propagated to the boto3 methods that are used by smart_open under the hood using the client_kwargs transport parameter.

For instance, to upload a blob with Metadata, ACL, StorageClass,

Core symbols most depended-on inside this repo

smart_open/concurrency.py

smart_open/bytebuffer.py

register_transport

called by 13

smart_open/transport.py

tell

called by 11

smart_open/http.py

Shape

Method 663

Function 227

Class 86

Route 21

Languages

Python100%

Modules by API surface

tests/test_smart_open.py193 symbols

tests/test_s3.py126 symbols

tests/test_azure.py103 symbols

smart_open/s3.py87 symbols

smart_open/azure.py66 symbols

tests/test_gcs.py57 symbols

integration-tests/test_s3_ported.py32 symbols

smart_open/webhdfs.py29 symbols

tests/test_http.py23 symbols

smart_open/hdfs.py23 symbols

smart_open/http.py21 symbols

tests/test_ssh.py19 symbols

Used by 1 indexed graphs manifest dependencies, hub-wide

github.com/allenai/olmocr

Dependencies from manifests, versioned

wrapt1×

For agents

$ claude mcp add smart_open \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact