MCPcopy
hub / github.com/deepseek-ai/smallpond

github.com/deepseek-ai/smallpond @v0.15.0 sqlite

repository ↗ · DeepWiki ↗ · release v0.15.0 ↗
1,148 symbols 4,339 edges 60 files 229 documented · 20%
README

smallpond

CI PyPI Docs License

A lightweight data processing framework built on DuckDB and 3FS.

Features

  • 🚀 High-performance data processing powered by DuckDB
  • 🌍 Scalable to handle PB-scale datasets
  • 🛠️ Easy operations with no long-running services

Installation

Python 3.8 to 3.12 is supported.

pip install smallpond

Quick Start

# Download example data
wget https://duckdb.org/data/prices.parquet
import smallpond

# Initialize session
sp = smallpond.init()

# Load data
df = sp.read_parquet("prices.parquet")

# Process data
df = df.repartition(3, hash_by="ticker")
df = sp.partial_sql("SELECT ticker, min(price), max(price) FROM {0} GROUP BY ticker", df)

# Save results
df.write_parquet("output/")
# Show results
print(df.to_pandas())

Documentation

For detailed guides and API reference: - Getting Started - API Reference

Performance

We executed the Gray Sort benchmark using smallpond on a cluster comprising 50 compute nodes and 25 storage nodes running 3FS. The benchmark sorted 110.5TiB of data in 30 minutes and 14 seconds, achieving a throughput of 3.66TiB/min.

Development

pip install .[dev]

# run unit tests
pytest -v tests/test*.py

# build documentation
pip install .[docs]
cd docs
make html
python -m http.server --directory build/html

License

This project is licensed under the MIT License.

Core symbols most depended-on inside this repo

add_argument
called by 126
smallpond/execution/driver.py
join
called by 100
smallpond/execution/executor.py
join
called by 56
smallpond/utility.py
DataSetPartitionNode
called by 35
smallpond/logical/node.py
add_elapsed_time
called by 34
smallpond/execution/task.py
to_arrow_table
called by 31
smallpond/logical/dataset.py
map
called by 29
smallpond/dataframe.py
get_output
called by 24
smallpond/execution/task.py

Shape

Method 859
Class 154
Function 133
Route 2

Languages

Python100%

Modules by API surface

smallpond/execution/task.py290 symbols
smallpond/logical/node.py137 symbols
smallpond/execution/scheduler.py99 symbols
smallpond/logical/dataset.py77 symbols
smallpond/execution/workqueue.py51 symbols
tests/test_execution.py50 symbols
smallpond/dataframe.py38 symbols
smallpond/logical/udf.py31 symbols
smallpond/execution/executor.py28 symbols
tests/test_dataframe.py23 symbols
smallpond/utility.py21 symbols
tests/test_fabric.py18 symbols

Dependencies from manifests, versioned

GPUtil1.4.0 · 1×
cloudpickle2.0.0 · 1×
duckdb1.2.0 · 1×
fsspec2023.12.2 · 1×
loguru0.7.2 · 1×
lxml4.9.3 · 1×
pandas1.3.4 · 1×
plotly5.22.0 · 1×
polars0.20.9 · 1×
psutil5.9.8 · 1×
py-libnuma1.2 · 1×
pyarrow16.1.0 · 1×

For agents

$ claude mcp add smallpond \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact