MCPcopy
hub / github.com/peak/s5cmd

github.com/peak/s5cmd @v2.3.0 sqlite

repository ↗ · DeepWiki ↗ · release v2.3.0 ↗
901 symbols 5,075 edges 81 files 453 documented · 50%
README

Go Report Github Actions Status

Overview

s5cmd is a very fast S3 and local filesystem execution tool. It comes with support for a multitude of operations including tab completion and wildcard support for files, which can be very handy for your object storage workflow while working with large number of files.

There are already other utilities to work with S3 and similar object storage services, thus it is natural to wonder what s5cmd has to offer that others don't.

In short, s5cmd offers a very fast speed. Thanks to Joshua Robinson for his study and experimentation on s5cmd; to quote his medium post:

For uploads, s5cmd is 32x faster than s3cmd and 12x faster than aws-cli. For downloads, s5cmd can saturate a 40Gbps link (~4.3 GB/s), whereas s3cmd and aws-cli can only reach 85 MB/s and 375 MB/s respectively.

If you would like to know more about performance of s5cmd and the reasons for its fast speed, refer to benchmarks section

Features

s5cmd supports wide range of object management tasks both for cloud storage services and local filesystems.

  • List buckets and objects
  • Upload, download or delete objects
  • Move, copy or rename objects
  • Set Server Side Encryption using AWS Key Management Service (KMS)
  • Set Access Control List (ACL) for objects/files on the upload, copy, move.
  • Print object contents to stdout
  • Select JSON records from objects using SQL expressions
  • Create or remove buckets
  • Summarize objects sizes, grouping by storage class
  • Wildcard support for all operations
  • Multiple arguments support for delete operation
  • Command file support to run commands in batches at very high execution speeds
  • Dry run support
  • S3 Transfer Acceleration support
  • Google Cloud Storage (and any other S3 API compatible service) support
  • Structured logging for querying command outputs
  • Shell auto-completion
  • S3 ListObjects API backward compatibility

Installation

Official Releases

Binaries

The Releases page provides pre-built binaries for Linux, macOS and Windows.

Homebrew

For macOS, a homebrew tap is provided:

brew install peak/tap/s5cmd

Unofficial Releases (by Community)

Packaging status

Warning These releases are maintained by the community. They might be out of date compared to the official releases.

MacPorts

You can also install s5cmd from MacPorts on macOS:

sudo port selfupdate
sudo port install s5cmd

Conda

s5cmd is included in the conda-forge channel, and it can be downloaded through the Conda.

Installing s5cmd from the conda-forge channel can be achieved by adding conda-forge to your channels with: conda config --add channels conda-forge conda config --set channel_priority strict

Once the conda-forge channel has been enabled, s5cmd can be installed with conda:

conda install s5cmd ps. Quoted from s5cmd feedstock. You can also find further instructions on its README.

FreeBSD

On FreeBSD you can install s5cmd as a package:

pkg install s5cmd

or via ports:

cd /usr/ports/net/s5cmd
make install clean

Build from source

You can build s5cmd from source if you have Go 1.19+ installed.

go install github.com/peak/s5cmd/v2@master

⚠️ Please note that building from master is not guaranteed to be stable since development happens on master branch.

Docker

Hub

$ docker pull peakcom/s5cmd
$ docker run --rm -v ~/.aws:/root/.aws peakcom/s5cmd <S3 operation>

ℹ️ /aws directory is the working directory of the image. Mounting your current working directory to it allows you to run s5cmd as if it was installed in your system;

docker run --rm -v $(pwd):/aws -v ~/.aws:/root/.aws peakcom/s5cmd <S3 operation>

Build

$ git clone https://github.com/peak/s5cmd && cd s5cmd
$ docker build -t s5cmd .
$ docker run --rm -v ~/.aws:/root/.aws s5cmd <S3 operation>

Usage

s5cmd supports multiple-level wildcards for all S3 operations. This is achieved by listing all S3 objects with the prefix up to the first wildcard, then filtering the results in-memory. For example, for the following command;

s5cmd cp 's3://bucket/logs/2020/03/*' .

first a ListObjects request is send, then the copy operation will be executed against each matching object, in parallel.

Specifying credentials

s5cmd uses official AWS SDK to access S3. SDK requires credentials to sign requests to AWS. Credentials can be provided in a variety of ways:

  • Command line options --profile to use a named profile, --credentials-file flag to use the specified credentials file

    ```sh

    Use your company profile in AWS default credential file

    s5cmd --profile my-work-profile ls s3://my-company-bucket/

    Use your company profile in your own credential file

    s5cmd --credentials-file ~/.your-credentials-file --profile my-work-profile ls s3://my-company-bucket/ ```

  • Environment variables

    ```sh

    Export your AWS access key and secret pair

    export AWS_ACCESS_KEY_ID='' export AWS_SECRET_ACCESS_KEY='' export AWS_PROFILE='' export AWS_REGION=''

    s5cmd ls s3://your-bucket/ ```

  • If s5cmd runs on an Amazon EC2 instance, EC2 IAM role

  • If s5cmd runs on EKS, Kube IAM role
  • Or, you can send requests anonymously with --no-sign-request option

    ```sh

    List objects in a public bucket

    s5cmd --no-sign-request ls s3://public-bucket/ ```

Region detection

While executing the commands, s5cmd detects the region according to the following order of priority:

  1. --source-region or --destination-region flags of cp command.
  2. AWS_REGION environment variable.
  3. Region section of AWS profile.
  4. Auto detection from bucket region (via HeadBucket API call).
  5. us-east-1 as default region.

Examples

Check if a bucket exists

s5cmd head s3://bucket/

Print a remote object's metadata

s5cmd head s3://bucket/object.gz

Download a single S3 object

s5cmd cp s3://bucket/object.gz .

Download multiple S3 objects

Suppose we have the following objects:

s3://bucket/logs/2020/03/18/file1.gz
s3://bucket/logs/2020/03/19/file2.gz
s3://bucket/logs/2020/03/19/originals/file3.gz
s5cmd cp 's3://bucket/logs/2020/03/*' logs/

s5cmd will match the given wildcards and arguments by doing an efficient search against the given prefixes. All matching objects will be downloaded in parallel. s5cmd will create the destination directory if it is missing.

logs/ directory content will look like:

$ tree
.
└── logs
    ├── 18
    │   └── file1.gz
    └── 19
        ├── file2.gz
        └── originals
            └── file3.gz

4 directories, 3 files

ℹ️ s5cmd preserves the source directory structure by default. If you want to flatten the source directory structure, use the --flatten flag.

s5cmd cp --flatten 's3://bucket/logs/2020/03/*' logs/

logs/ directory content will look like:

$ tree
.
└── logs
    ├── file1.gz
    ├── file2.gz
    └── file3.gz

1 directory, 3 files

Upload a file to S3

s5cmd cp object.gz s3://bucket/

by setting server side encryption (aws kms) of the file:

s5cmd cp -sse aws:kms -sse-kms-key-id <your-kms-key-id> object.gz s3://bucket/

by setting Access Control List (acl) policy of the object:

s5cmd cp -acl bucket-owner-full-control object.gz s3://bucket/

Upload multiple files to S3

s5cmd cp directory/ s3://bucket/

Will upload all files at given directory to S3 while keeping the folder hierarchy of the source.

Stream stdin to S3

You can upload remote objects by piping stdin to s5cmd:

curl https://github.com/peak/s5cmd/ | s5cmd pipe s3://bucket/s5cmd.html

Or you can compress the data before uploading:

gzip -c file | s5cmd pipe s3://bucket/file.gz

Delete an S3 object

s5cmd rm s3://bucket/logs/2020/03/18/file1.gz

Delete multiple S3 objects

s5cmd rm s3://bucket/logs/2020/03/19/*

Will remove all matching objects:

s3://bucket/logs/2020/03/19/file2.gz
s3://bucket/logs/2020/03/19/originals/file3.gz

s5cmd utilizes S3 delete batch API. If matching objects are up to 1000, they'll be deleted in a single request. However, it should be noted that commands such as

s5cmd rm s3://bucket-foo/object s3://bucket-bar/object

are not supported by s5cmd and result in error (since we have 2 different buckets), as it is in odds with the benefit of performing batch delete requests. Thus, if in need, one can use s5cmd run mode for this case, i.e,

$ s5cmd run
rm s3://bucket-foo/object
rm s3://bucket-bar/object

more details and examples on s5cmd run are presented in a later section.

Copy objects from S3 to S3

s5cmd supports copying objects on the server side as well.

s5cmd cp 's3://bucket/logs/2020/*' s3://bucket/logs/backup/

Will copy all the matching objects to the given S3 prefix, respecting the source folder hierarchy.

⚠️ Copying objects (from S3 to S3) larger than 5GB is not supported yet. We have an open ticket to track the issue.

Using Exclude and Include Filters

s5cmd supports the --exclude and --include flags, which can be used to specify patterns for objects to be excluded or included in commands.

  • The --exclude flag specifies objects that should be excluded from the operation. Any object that matches the pattern will be skipped.
  • The --include flag specifies objects that should be included in the operation. Only objects that match the pattern will be handled.
  • If both flags are used, --exclude has precedence over --include. This means that if an object URL matches any of the --exclude patterns, the object will be skipped, even if it also matches one of the --include patterns.
  • The order of the flags does not affect the results (unlike aws-cli).

The command below will delete only objects that end with .log.

s5cmd rm --include "*.log" 's3://bucket/logs/2020/*'

The command below will delete all objects except those that end with .log or .txt.

s5cmd rm --exclude "*.log" --exclude "*.txt" 's3://bucket/logs/2020/*'

If you wish, you can use multiple flags, like below. It will download objects that start with request or end with .log.

s5cmd cp --include "*.log" --include "request*" 's3://bucket/logs/2020/*' .

Using a combination of --include and --exclude also possible. The command below will only sync objects that end with .log or .txt but exclude those that start with access_. For example, request.log, and license.txt will be included, while access_log.txt, and readme.md are excluded.

s5cmd sync --include "*.log" --exclude "access_*" --include "*.txt" 's3://bucket/logs/*' .

Select JSON object content using SQL

s5cmd supports the SelectObjectContent S3 operation, and will run your SQL query against objects matching normal wildcard syntax and emit matching JSON records via stdout. Records from multiple objects will be interleaved, and order of the records is not guaranteed (though it's likely that the records from a single object will arrive in-order, even if interleaved with other records).

$ s5cmd select --compression GZIP \
  --query "SELECT s.timestamp, s.hostname FROM S3Object s WHERE s.ip_address LIKE '10.%' OR s.application='unprivileged'" \
  s3://bucket-foo/object/2021/*
{"timestamp":"2021-07-08T18:24:06.665Z","hostname":"application.internal"}
{"timestamp":"2021-07-08T18:24:16.095Z","hostname":"api.github.com"}

At the moment this operation only supports JSON records selected with SQL. S3 calls this lines-type JSON, but it seems that it works even if the records aren't line-delineated. YMMV.

Count objects and determine total size

$ s5cmd du --humanize 's3://bucket/2020/*'

30.8M bytes in 3 objects: s3://bucket/2020/*

Run multiple commands in parallel

The most powerful feature of s5cmd is the commands file. Thousands of S3 and filesystem commands are declared in a file (or simply piped in from another process) and they are executed using multiple parallel workers. Since only one program is launched, thousands of unnecessary fork-exec calls are avoided. This way S3 execution times can reach a few thousand operations per second.

s5cmd run commands.txt

or

cat commands.txt | s5cmd run

commands.txt content could look like:

cp s3://bucket/2020/03/* logs/2020/03/

# line comments are supported
rm s3://bucket/2020/03/19/file2.gz

# empty lines are OK too like above

# rename an S3 object
mv s3://bucket/2020/03/18/file1.gz s3://bucket/2020/03/18/original/file.gz

Sync

sync command synchronizes S3 buckets, prefixes, direct

Extension points exported contracts — how you extend this code

Message (Interface)
Message is an interface to print structured logs. [11 implementers]
log/message.go
Storage (Interface)
Storage is an interface for storage operations that is common to local filesystem and remote object storage. [4 implementers]
storage/storage.go
SyncStrategy (Interface)
SyncStrategy is the interface to make decision whether given source object should be synced to destination object [2 implementers]
command/sync_strategy.go
EventStreamDecoder (Interface)
EventStreamDecoder decodes a s3.Event with the given decoder. [2 implementers]
storage/s3.go
ProgressBar (Interface)
(no doc) [2 implementers]
progressbar/progressbar.go
Task (FuncType)
Task is a function type for parallel manager.
parallel/parallel.go
Option (FuncType)
(no doc)
storage/url/url.go

Core symbols most depended-on inside this repo

printError
called by 99
command/error.go
String
called by 88
storage/url/url.go
Run
called by 68
command/cp.go
String
called by 61
command/flag.go
String
called by 61
log/log.go
New
called by 60
storage/url/url.go
Join
called by 52
storage/url/url.go
commandFromContext
called by 39
command/context.go

Shape

Function 570
Method 235
Struct 74
FuncType 7
TypeAlias 7
Interface 5
Class 3

Languages

Go97%
Python3%

Modules by API surface

e2e/cp_test.go96 symbols
e2e/util_test.go74 symbols
storage/s3.go55 symbols
e2e/sync_test.go49 symbols
storage/url/url.go32 symbols
storage/storage.go32 symbols
e2e/rm_test.go30 symbols
benchmark/bench.py28 symbols
storage/s3_test.go26 symbols
command/cp.go26 symbols
e2e/ls_test.go25 symbols
progressbar/progressbar.go22 symbols

Dependencies from manifests, versioned

github.com/VividCortex/ewmav1.2.0 · 1×
github.com/aws/aws-sdk-gov1.44.298 · 1×
github.com/cpuguy83/go-md2man/v2v2.0.2 · 1×
github.com/hashicorp/errwrapv1.0.0 · 1×
github.com/iancoleman/strcasev0.0.0-2019111223294 · 1×
github.com/igungor/gofakes3v0.0.18 · 1×

For agents

$ claude mcp add s5cmd \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact