Tile Language (tile-lang) is a concise domain-specific language designed to streamline the development of high-performance GPU/CPU kernels (e.g., GEMM, Dequant GEMM, FlashAttention, LinearAttention). By employing a Pythonic syntax with an underlying compiler infrastructure on top of TVM, tile-lang allows developers to focus on productivity without sacrificing the low-level optimizations necessary for state-of-the-art performance.

T.gemm_sp for 2:4 sparse tensor core support, check out Pull Request #526 for details.T.print for printing variables/buffers (docs) and a memory layout plotter (examples/plot_layout).Although tile-lang aims to be portable across a range of Devices, it has been specifically tested and validated on the following devices: for NVIDIA GPUs, this includes the H100 (with Auto TMA/WGMMA support), A100, V100, RTX 4090, RTX 3090, and RTX A6000; for AMD GPUs, it includes the MI250 (with Auto MatrixCore support) and the MI300X (with Async Copy support).
tile-lang provides the building blocks to implement a wide variety of operators. Some examples include:
Within the examples directory, you will also find additional complex kernels—such as convolutions, forward/backward passes for FlashAttention, more operators will continuously be added.
TileLang achieves exceptional performance across a variety of computational patterns. Comprehensive benchmark scripts and settings are available at tilelang-benchmark. Below are selected results showcasing its capabilities:
MLA Decoding Performance on H100


Flash Attention Performance on H100

Matmul Performance on GPUs (RTX 4090, A100, H100, MI300X)

Dequantize Matmul Performance on A100

The quickest way to get started is to install the latest release from PyPI:
pip install tilelang
Alternatively, you can install directly from the GitHub repository:
pip install git+https://github.com/tile-ai/tilelang
Or install locally:
# install required system dependencies
sudo apt-get update
sudo apt-get install -y python3-setuptools gcc libtinfo-dev zlib1g-dev build-essential cmake libedit-dev libxml2-dev
pip install -e . -v # remove -e option if you don't want to install in editable mode, -v for verbose output
We currently provide three ways to install tile-lang from source: - Install from Source (using your own TVM installation) - Install from Source (using the bundled TVM submodule) - Install Using the Provided Script
For users who want access to the latest features and improvements before official releases, we provide nightly builds of tile-lang.
pip install tilelang -f https://tile-ai.github.io/whl/nightly
# or pip install tilelang --find-links https://tile-ai.github.io/whl/nightly
Note: Nightly builds contain the most recent code changes but may be less stable than official releases. They're ideal for testing new features or if you need a specific bugfix that hasn't been released yet.
In this section, you'll learn how to write and execute a straightforward GEMM (matrix multiplication) kernel using tile-lang, followed by techniques for layout optimizations, pipelining, and L2-cache–friendly swizzling.
Below is an example that demonstrates more advanced features: layout annotation, parallelized copy, and swizzle for improved L2 cache locality. This snippet shows how to adapt your kernel to maximize performance on complex hardware.
# @tilelang.jit(target="cuda")
# target currently can be "cuda" or "hip" or "cpu".
# if not specified, it will be inferred from the input tensors during compile time
@tilelang.jit
def matmul_relu(
A, B,
block_M: int = 64,
block_N: int = 64,
block_K: int = 64,
dtype: T.dtype = T.float16,
accum_dtype: T.dtype = T.float32,
):
# declare compilation shape constant
M, N, K = T.const('M, N, K')
# annotate input tensor shape
A: T.Tensor[[M, K], dtype]
B: T.Tensor[[K, N], dtype]
# allocate output tensor
C = T.empty([M, N], dtype)
with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=128) as (bx, by):
A_shared = T.alloc_shared((block_M, block_K), dtype)
B_shared = T.alloc_shared((block_K, block_N), dtype)
C_local = T.alloc_fragment((block_M, block_N), accum_dtype)
# Enable rasterization for better L2 cache locality (Optional)
# T.use_swizzle(panel_size=10, enable=True)
# Clear local accumulation
T.clear(C_local)
for ko in T.Pipelined(T.ceildiv(K, block_K), num_stages=3):
# Copy tile of A
# This is a sugar syntax for parallelized copy
T.copy(A[by * block_M, ko * block_K], A_shared)
# Copy tile of B
T.copy(B[ko * block_K, bx * block_N], B_shared)
# Perform a tile-level GEMM on the shared buffers
# Currently we dispatch to the cute/hip on Nvidia/AMD GPUs
T.gemm(A_shared, B_shared, C_local)
# relu
for i, j in T.Parallel(block_M, block_N):
C_local[i, j] = T.max(C_local[i, j], 0)
# Copy result back to global memory
T.copy(C_local, C[by * block_M, bx * block_N])
# You can write multiple cuda kernel in one function, they execute sequentially
# with T.Kernel(...) as ...
# Return the tensor, you can also return multiple tensors
return C
M, N, K = 1024, 1024, 1024
a = torch.randn(M, K, device="cuda", dtype=torch.float16)
b = torch.randn(K, N, device="cuda", dtype=torch.float16)
c_ref = torch.relu(a @ b)
# Call the kernel
c = matmul_relu(a, b)
torch.testing.assert_close(c, c_ref, rtol=1e-2, atol=1e-2)
# Call the kernel with overwritten compilation constants
c = matmul_relu(a, b, block_M=128, block_N=128, block_K=64)
torch.testing.assert_close(c, c_ref, rtol=1e-2, atol=1e-2)
# Retrieve the compiled kernel
kernel = matmul_relu.compile(a, b) # use torch.Tensor
kernel = matmul_relu.compile( # use T.Tensor as placeholder
T.Tensor((M, K), T.float16),
T.Tensor((K, N), T.float16)
)
kernel = matmul_relu.compile( # directly specify the shape constants
M=M, N=N, K=K,
block_M=128, block_N=128, block_K=64
)
print(kernel.get_kernel_source())
c = kernel(a, b)
# 5.Profile latency with kernel
profiler = kernel.get_profiler(tensor_supply_type=tilelang.TensorSupplyType.Normal)
latency = profiler.do_bench()
print(f"Latency: {latency} ms")
In addition to GEMM, we provide a variety of examples to showcase the versatility and power of TileLang, including:
Check our tilelang v0.2.0 release plan for upcoming features.
TileLang has now been used in project BitBLAS and AttentionEngine.
Welcome to join our Discord community for discussions, support, and collaboration!
We would like to express our gratitude to the TVM community for their invaluable contributions. The initial version of this project was mainly developed by LeiWang1999, chengyupku and nox-410 with supervision from Prof. Zhi Yang at Peking University. Part of this work was carried out during an internship at Microsoft Research, where Dr. Lingxiao Ma, Dr. Yuqing Xia, Dr. Jilong Xue, and Dr. Fan Yang offered valuable advice and support. We deeply appreciate their mentorship and contributions.
$ claude mcp add tilelang \
-- python -m otcore.mcp_server <graph>