hub / github.com/togethercomputer/RedPajama-Data

github.com/togethercomputer/RedPajama-Data @main sqlite

296 symbols 1,013 edges 58 files 81 documented · 27%

README

RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models

This repository contains the code for the RedPajama-V2 dataset. For more information on the dataset, check out our blog post. The dataset is also available on HuggingFace. For the code used for the RedPajama-1T dataset, please refer to the rp_v1 branch in this repo.

Dataset

RedPajama-V2 is an open dataset for training large language models. The dataset includes over 100B text documents coming from 84 CommonCrawl snapshots and processed using the CCNet pipeline. Out of these, there are 30B documents in the corpus that additionally come with quality signals, and 20B documents that are deduplicated.

Document and Token Counts for the Annotated and deduplicated `head_middle` part of the dataset

The number of documents and tokens for the annotated and deduplicated head_middle part of the dataset is shown in the table below.

	# Documents	Estimated Token count (deduped)
en	14.5B	20.5T
de	1.9B	3.0T
fr	1.6B	2.7T
es	1.8B	2.8T
it	0.9B	1.5T
Total	20.8B	30.4T

Languages

English, German, French, Italian, Spanish

Setup

Configuration

Copy the file configs/rp_v2.0.conf to e.g. configs/default.conf and configure the environment variables. These will be used throughout the pipeline.

Buid Docker image

To run with docker, build the docker image using

. configs/default.conf
cd app
docker build -t "${DOCKER_REPO}:" .

Also, make sure you have s5cmd installed and your S3 profile configured so that you can pull data from an S3 bucket.

You can run the steps of the pipeline without any containerized environment. However, the running scripts assume you have a docker and apptainer installation.

Running the Pipeline

The pipeline is composed of three steps, namely 1) preparing artifacts, 2) computing quality signals, and 3) deduplication.

Important: In case you are not running steps (1) and (2) with the provided scripts (i.e., docker containers built with the provided Dockerfile), make sure to set the PYTHONHASHSEED environment variable to a consistent value (e.g., 42) using

export PYTHONHASHSEED=42

This is to ensure consistency of hash functions used in the computation of DSIR weights.

1. Create Artifacts

This part of the pipeline creates the artifacts that are used in the subsequent steps. This includes building quality classifiers, training bag-of-ngram generative models for importance weight computation, fetching the list of bad words from the LDNOOBW repo, and fetching the most recent list of blacklisted urls from the UT1 blacklist.

As a first step, download the english wikipedia reference classifier from here and place it in ${DATA_ROOT}/wikiref-model/en/en-model.bin. This is the same fasttext classifier that was used in RedPajama-V1.

To create the remaining artifacts, make sure that the environment variables are set in the config file. Then, from the root directory of the repository, run

bash scripts/run_prep_artifacts.sh \
  --config configs/rp_v2.0.conf \
  --listings /path/to/listings/file.txt\
  --max_workers 32

where /path/to/listings/file.txt is a file that contains the keys to the ccnet data that you want to process (e.g., 2023-06/0000/en_head.json.gz).

You can set the max_workers flag to the number of parallel processes you want to use.

This step will generate an id which you can store in the environment variable ARTIFACTS_ID for the next step.

2. Compute Quality Signals

The second step of the pipeline compute the quality signals, including the minhash signatures to run fuzzy deduplication in the subsequent step. To run this step, make sure the environment variables are set in the config file. Then, from the root directory of the repository, run

bash scripts/apptainer_run_quality_signals.sh \
  --config configs/rp_v2.0.conf \
  --dump_id "2022-49" \
  --input_base_uri "file:///path/to/data/root" \
  --output_base_uri "file:///path/to/outout/data/root" \
  --max_docs -1

3. Deduplication

The third component of the pipeline consists of deduplication steps. Here we provide code to run exact and fuzzy deduplication.

Exact Deduplication using a Bloomfilter

Content based deduplication is implemented in app/src/bloomfilter.py. It can be run independently of the previous step, but the data needs to stored in an S3 bucket. For this step, from the app directory, run:

python3 app/src/bloomfilter.py \
  --listings /path/to/listings/file.txt \
  --input_base_uri "s3://path/to/ccnet/data" \
  --output_dir "/path/to/output" \
  --s3_profile "..." \
  --endpoint_url "..." \
  --parallel_readers 32 \
  --batch_size 10 \
  --capacity "..." \
  --error_rate "..."

It is important to choose the correct capacity (i.e., > #documents), since otherwise the error_rate will not be guaranteed and more false positives will appear. The implementation is based on the pybloomfiltermmap3 library.

Fuzzy Deduplication with Locality Sensitive Hashing

In the third step of the pipeline, we run locality sensitive hashing on the minhash signatures generated in the first step. To run this step, make sure that you use the same configuration as in the quality signals step. Then, from the root directory of the repository, run

bash scripts/apptainer_run_lsh.sh \
  --config configs/rp_v2.0.conf \
  --dump_id "2022-49" \
  --input_base_uri "file:///path/to/data/root" \
  --output_dir "/path/to/output" \
  --similarity "<similarity_threshold>" \
  --listings "/minhash/listings/file.txt" \
  --max_docs -1

The implementation is based on polars and was tested with 200M documents on a 64 core machine with 500G of RAM.

Summary of Quality Signals

The second step of this pipeline computes the following set of quality signals. We hope to grow this list further over time as more signals are developed.

Quality Annotations

Annotation Tag	Description	Category	Reference
ccnet_bucket	head, middle or tail bucket of the perplexity score	CCNet	CCNet
ccnet_language_score	score of the language identification model	CCNet	CCNet
ccnet_length	number of characters	CCNet	CCNet
ccnet_nlines	number of lines	CCNet	CCNet
ccnet_original_length	number of characters before in-document line deduplication	CCNet	CCNet
ccnet_original_nlines	number of lines before in-document line deduplication	CCNet	CCNet
ccnet_perplexity	perplexity of an LM trained on Wikipedia	CCNet	CCNet
rps_doc_books_importance	Given a bag of {1,2}-wordgram model trained on Books p, and a model trained on the source domain q, This is the logarithm of the ratio p(doc)/q(doc).	ML Heuristics	Importance Resampling (Xie et al.)
rps_doc_openwebtext_importance	Given a bag of {1,2}-wordgram model trained on OpenWebText p, and a model trained on the source domain q, this is the logarithm of the ratio p(doc)/q(doc).	ML Heuristics	Importance Resampling (Xie et al.)
rps_doc_wikipedia_importance	Given a bag of {1,2}-wordgram model trained on Wikipedia articles p, and a model trained on the source domain q, this is the logarithm of the ratio p(doc)/q(doc).	ML Heuristics	Importance Resampling (Xie et al.)

Core symbols most depended-on inside this repo

called by 13

app/src/utilities/io/writer.py

write

called by 11

app/src/utilities/io/writer.py

read

called by 8

app/src/utilities/io/reader.py

get_callables_from_module

called by 7

app/src/utilities/register/registry_utils.py

_compute_ngrams

called by 6

app/src/core/document.py

signal_schema

called by 6

app/src/utilities/register/registry_utils.py

form_ngrams

called by 6

app/src/utilities/text/ngrams.py

update

called by 5

app/src/utilities/logging/trackers.py

Shape

Method 173

Class 71

Function 52

Languages

Python100%

Modules by API surface

app/src/core/document.py22 symbols

app/src/core/quality_signals/natural_language.py20 symbols

app/src/core/quality_signals/lines.py20 symbols

app/src/core/quality_signals/importance_weights.py18 symbols

app/src/token_count.py16 symbols

app/src/core/quality_signals/repetitions.py15 symbols

app/src/core/quality_signals/content.py15 symbols

app/src/run_lsh.py14 symbols

app/src/bloomfilter.py14 symbols

app/src/pipeline.py12 symbols

app/src/utilities/io/writer.py11 symbols

app/src/core/quality_signals/classifiers.py11 symbols

For agents

$ claude mcp add RedPajama-Data \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact

github.com/togethercomputer/RedPajama-Data @main sqlite

RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models

Dataset

Document and Token Counts for the Annotated and deduplicated head_middle part of the dataset

Languages

Setup

Configuration

Buid Docker image

Running the Pipeline

1. Create Artifacts

2. Compute Quality Signals

3. Deduplication

Exact Deduplication using a Bloomfilter

Fuzzy Deduplication with Locality Sensitive Hashing

Summary of Quality Signals

Quality Annotations

Core symbols most depended-on inside this repo

Shape

Languages

Modules by API surface

For agents

Document and Token Counts for the Annotated and deduplicated `head_middle` part of the dataset