hub / github.com/NVIDIA/aistore

github.com/NVIDIA/aistore @3.11 sqlite

repository ↗ · DeepWiki ↗ · release 3.11 ↗

8,411 symbols 45,781 edges 643 files 2,538 documented · 30%

README

AIStore is a lightweight object storage system with the capability to linearly scale out with each added storage node and a special focus on petascale deep learning.

AIStore (AIS for short) is a built from scratch, lightweight storage stack tailored for AI apps. AIS consistently shows balanced I/O distribution and linear scalability across arbitrary numbers of clustered servers, producing performance charts that look as follows:

I/O distribution

The picture above comprises 120 HDDs.

The ability to scale linearly with each added disk was, and remains, one of the main incentives behind AIStore. Much of the development is also driven by the ideas to offload dataset transformations to AIS clusters.

Features

scale out with no downtime and no limitation;
arbitrary number of extremely lightweight access points;
highly-available control and data planes, end-to-end data protection, self-healing, n-way mirroring, k/m erasure coding;
comprehensive native HTTP-based (S3-like) API, as well as
compliant Amazon S3 API to run unmodified S3 clients and apps;
automated cluster rebalancing upon any changes in cluster membership, drive failures and attachments, bucket renames;
ETL offload via offline (dataset to dataset) or inline (on-the-fly) transformations.

Also, AIStore:

can be deployed on any commodity hardware - effectively, on any Linux machine(s);
can be immediately populated - i.e., hydrated - from any file-based data source (local or remote, ad-hoc/on-demand or via asynchronus batch);
provides for easy Kubernetes deployment via a separate GitHub repo with
step-by-step deployment playbooks, and
AIS/K8s Operator;
contains integrated CLI for easy management and monitoring;
can ad-hoc attach remote AIS clusters, thus gaining immediate access to the respective hosted datasets
(referred to as global namespace capability);
natively reads, writes, and lists popular archives including tar, tar.gz, zip, and MessagePack;
distributed shuffle of those archival formats is also supported;
fully supports Amazon S3, Google Cloud, and Microsoft Azure backends
providing unified global namespace simultaneously across multiple backends:

Supported Backends

can be deployed as LRU-based fast cache for remote buckets; can be populated on-demand and/or via prefetch and download APIs;
can be used as a standalone highly-available protected storage;
includes MapReduce extension for massively parallel resharding of very large datasets;

AIS runs natively on Kubernetes and features open format - thus, the freedom to copy or move your data from AIS at any time using the familiar Linux tar(1), scp(1), rsync(1) and similar.

For developers and data scientists, there's also: * native Go (language) API that we utilize in a variety of tools including CLI and Load Generator; * native Python API, and Python SDK that also contains PyTorch integration and usage examples.

For security and fine-grained (OAuth 2.0 compliant) access control to cluster resources and stored datasets, AIStore includes: * Authentication Server (AuthN). A single AuthN instance, currently at v1.0, can provide security/authentication for multiple AIStore clusters.

For the original AIStore white paper and design philosophy, for introduction to large-scale deep learning and the most recently added features, please see AIStore Overview (where you can also find six alternative ways to work with existing datasets). Videos and animated presentations can be found at videos.

Finally, getting started with AIS takes only a few minutes.

Deployment options

AIS deployment options, as well as intended (development vs. production vs. first-time) usages, are all summarized here.

Since prerequisites boil down to, essentially, having Linux with a disk the deployment options range from all-in-one container to a petascale bare-metal cluster of any size, and from a single VM to multiple racks of high-end servers. But practical use cases require, of course, further consideration and may include:

Option	Objective
Local playground	AIS developers and development, Linux or Mac OS
Minimal production-ready deployment	This option utilizes preinstalled docker image and is targeting first-time users or researchers (who could immediately start training their models on smaller datasets)
Easy automated GCP/GKE deployment	Developers, first-time users, AI researchers
Large-scale production deployment	Requires Kubernetes and is provided via a separate repository: ais-k8s

Further, there's the capability referred to as global namespace: given HTTP(S) connectivity, AIS clusters can be easily interconnected to "see" each other's datasets. Hence, the idea to start "small" to gradually and incrementally build high-performance shared capacity.

For detailed discussion on supported deployments, please refer to Getting Started.

For performance tuning and preparing AIS nodes for bare-metal deployment, see performance.

Related Software

When it comes to PyTorch, WebDataset is the preferred AIStore client.

WebDataset is a PyTorch Dataset (IterableDataset) implementation providing efficient access to datasets stored in POSIX tar archives.

Further references include technical blog titled AIStore & ETL: Using WebDataset to train on a sharded dataset where you can also find easy step-by-step instruction.

Guides and References

Getting Started
Technical Blog
API and SDK
Go (language) API
Python SDK via Python Package Index (PyPI)
REST API
S3 compatibility
CLI
Create, destroy, list, copy, rename, transform, configure, evict buckets
GET, PUT, APPEND, PROMOTE, and other operations on objects
Cluster and node management
Mountpath (disk) management
Attach, detach, and monitor remote clusters
Start, stop, and monitor downloads
Distributed shuffle
User account and access management
Job (aka xaction) management
Tutorials
Tutorials
Videos
Power tools and extensions
Reading, writing, and listing archives
Distributed Shuffle
Downloader
Extract, Transform, Load
Tools and utilities
Benchmarking and tuning Performance
AIS Load Generator: integrated benchmark tool
How to benchmark
Performance tuning and testing
Buckets and Backend Providers
Backend providers
Buckets
Storage Services
Storage Services
Checksumming: brief theory of operations
S3 compatibility
Cluster Management
Joining AIS cluster
Leaving AIS cluster
Global Rebalance
Troubleshooting
Configuration
Configuration
CLI to view and update cluster and node config
Observability
Observability
Prometheus
For developers
Getting started
Docker
Useful scripts
Profiling, race-detecting, and more
Batch operations
Batch operations
eXtended Actions (xactions)
CLI: job management
Topics
System files
aisnode command line
Traffic patterns
Highly available control plane
File access (experimental)
Downloader
On-disk layout
AIS Buckets: definition, operations, properties

License

MIT

Author

Alex Aizman (NVIDIA)

Extension points exported contracts — how you extend this code

Unpacker (Interface)

Every object that is going to use binary representation instead of JSON must implement two following methods [7 implementers]

cmn/cos/bytepack.go

BackendProvider (Interface)

ais target's types and interfaces [7 implementers]

cluster/target.go

Bowner (Interface)

interface to Get current BMD instance (for implementation, see ais/bucketmeta.go) [45 implementers]

cluster/bmd.go

Creator (Interface)

Creator is interface which describes set of functions which each shard creator should implement. [5 implementers]

dsort/extract/managers.go

Reader (Interface)

Reader is the interface a client works with to read in data and send to a HTTP server [5 implementers]

devtools/readers/readers.go

Slistener (Interface)

Smap on-change listeners [5 implementers]

cluster/map.go

Validator (Interface)

(no doc) [40 implementers]

cmn/config.go

Opts (Interface)

(no doc) [12 implementers]

cmn/jsp/opts.go

Core symbols most depended-on inside this repo

CheckFatal

called by 1054

devtools/tassert/tassert.go

cmn/debug/debug_on.go

3rdparty/glog/glog.go

devtools/tassert/tassert.go

Shape

Method 4,263

Function 2,945

Struct 967

TypeAlias 94

Interface 80

Class 40

FuncType 22

Languages

Go97%

Python3%

Modules by API surface

cmn/config.go150 symbols

cmn/err.go134 symbols

ais/htcommon.go112 symbols

ais/htrun.go93 symbols

ais/proxy.go83 symbols

nl/listener.go82 symbols

3rdparty/glog/glog.go81 symbols

cmn/cos/io.go78 symbols

fs/fs.go75 symbols

cluster/map.go74 symbols

cmd/cli/commands/utils.go71 symbols

3rdparty/atomic/atomic.go71 symbols

Dependencies from manifests, versioned

cloud.google.com/gov0.101.1 · 1×

cloud.google.com/go/computev1.6.1 · 1×

cloud.google.com/go/iamv0.3.0 · 1×

cloud.google.com/go/storagev1.22.0 · 1×

github.com/Azure/azure-pipeline-gov0.2.3 · 1×

github.com/Azure/azure-storage-blob-gov0.15.0 · 1×

github.com/NVIDIA/go-tfdatav0.3.1 · 1×

github.com/OneOfOne/xxhashv1.2.8 · 1×

github.com/VividCortex/ewmav1.2.0 · 1×

github.com/acarl005/stripansiv0.0.0-2018011610285 · 1×

github.com/andybalholm/brotliv1.0.4 · 1×

github.com/aws/aws-sdk-gov1.44.13 · 1×

For agents

$ claude mcp add aistore \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact