MCPcopy
hub / github.com/amazon-science/mm-cot

github.com/amazon-science/mm-cot @main sqlite

repository ↗ · DeepWiki ↗
2,091 symbols 5,637 edges 125 files 717 documented · 34%
README

Multimodal Chain-of-Thought Reasoning in Language Models

"Imagine learning a textbook without figures or tables."

Multimodal-CoT incorporates vision features in a decoupled training framework. The framework consists of two training stages: (i) rationale generation and (ii) answer inference. Both stages share the same model architecture but differ in the input and output.

Requirements

Install all required python dependencies:

pip install -r requirements.txt

Datasets

Download the dataset from the following repository:

https://github.com/lupantech/ScienceQA/tree/main/data

The vision features (detr, resnet, clip, vit) are available at https://huggingface.co/cooelf/vision_features/tree/main

Alternatively, you may download the extracted vision features (detr, resnet, clip) from vision_features and unzip the files under vision_features

Extract Features (optional)

The processed vision features for ScienceQA are available at https://huggingface.co/cooelf/vision_features/tree/main.

The following instructions show how we obtain those features.

Download the image files from Google Drive and unzip all the images (train, dev, test) in the same folder (). The structure should be:

images
├── 1
│   └── image.png
├── 2
│   └── image.png
├── 3
│   └── image.png
├── 5
│   └── image.png
├── 7
│   └── image.png

Run extract_features.py --data_root images --output_dir vision_features --img_type vit

If you hope to use your own images, please structure those images in the way above, or modify the script extract_features.py.

Extract Captions (optional)

The processed captions for ScienceQA are available at data/instruct_captions.json.

The following instructions show how we obtain those features.

Intall lavis and prepare Vicuna weights to use InstructBLIP for caption extraction.

https://github.com/salesforce/LAVIS/tree/f982acc73288408bceda2d35471a8fcf55aa04ca/projects/instructblip

Assume that the images are stored in the images folder.

python extract_caption.py

Instructions

Training

# rationale generation
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py \
    --data_root data/ScienceQA/data \
    --caption_file data/instruct_captions.json \
    --model declare-lab/flan-alpaca-large \
    --user_msg rationale --img_type vit \
    --bs 2 --eval_bs 4 --epoch 50 --lr 5e-5 --output_len 512 \
    --use_caption --use_generate --prompt_format QCM-E \
    --output_dir experiments

# answer inference
CUDA_VISIBLE_DEVICES=0,1,2,3 python main_central.py \
    --data_root data/ScienceQA/data \
    --caption_file data/instruct_captions.json \
    --model declare-lab/flan-alpaca-large \
    --user_msg answer --img_type vit \
    --bs 4 --eval_bs 8 --epoch 50 --lr 5e-5 --output_len 64 \
    --use_caption --use_generate --prompt_format QCMG-A \
    --output_dir experiments \
    --eval_le experiments/rationale_declare-lab-flan-alpaca-large_vit_QCM-E_lr5e-05_bs8_op512_ep50/predictions_ans_eval.json \
    --test_le experiments/rationale_declare-lab-flan-alpaca-large_vit_QCM-E_lr5e-05_bs8_op512_ep50/predictions_ans_test.json

Inference

Our trained models are available at https://huggingface.co/cooelf/mm-cot/tree/main. To use our trained models, please put the them under the models folder.

# rationale generation
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py \
    --data_root data/ScienceQA/data \
    --caption_file data/instruct_captions.json \
    --model declare-lab/flan-alpaca-large \
    --user_msg rationale --img_type vit \
    --bs 2 --eval_bs 4  --epoch 50 --lr 5e-5 --output_len 512 \
    --use_caption --use_generate --prompt_format QCM-E \
    --output_dir experiments
    --evaluate_dir models/mm-cot-large-rationale

# answer inference
CUDA_VISIBLE_DEVICES=0,1,2,3 python main_central.py \
    --data_root data/ScienceQA/data \
    --caption_file data/instruct_captions.json \
    --model declare-lab/flan-alpaca-large \
    --user_msg answer --img_type vit \
    --bs 4 --eval_bs 8 --epoch 50 --lr 5e-5 --output_len 64  \
    --use_caption --use_generate --prompt_format QCMG-A \
    --output_dir experiments \
    --eval_le experiments/rationale_declare-lab-flan-alpaca-large_vit_QCM-E_lr5e-05_bs8_op512_ep50/predictions_ans_eval.json \
    --test_le experiments/rationale_declare-lab-flan-alpaca-large_vit_QCM-E_lr5e-05_bs8_op512_ep50/predictions_ans_test.json \
    --evaluate_dir models/mm-cot-large-answer

Citing MM-CoT

@article{zhang2023multicot,
  title={Multimodal Chain-of-Thought Reasoning in Language Models},
  author={Zhang, Zhuosheng and Zhang, Aston and Li, Mu and Zhao, Hai and Karypis, George and Smola, Alex},
  journal={arXiv preprint arXiv:2302.00923},
  year={2023}
}

License

This project is licensed under the Apache-2.0 License.

Acknowledgement

Part of our codes are adapted from ScienceQA, Transformers, pytorch-image-models.

We thank Pan Lu for providing parameter size for ScienceQA baselines.

Core symbols most depended-on inside this repo

_cfg
called by 108
timm/models/efficientnet.py
_cfg
called by 79
timm/models/resnet.py
_create_resnet
called by 79
timm/models/resnet.py
get
called by 52
timm/models/features.py
format
called by 47
timm/utils/log.py
build_model_with_cfg
called by 44
timm/models/helpers.py
_dcfg
called by 43
timm/models/nfnet.py
_create_normfreenet
called by 43
timm/models/nfnet.py

Shape

Function 970
Method 853
Class 268

Languages

Python100%

Modules by API surface

timm/models/efficientnet.py140 symbols
timm/models/resnet.py102 symbols
timm/models/byobnet.py71 symbols
timm/models/nfnet.py68 symbols
timm/data/auto_augment.py59 symbols
timm/models/resnetv2.py56 symbols
timm/models/vision_transformer.py55 symbols
timm/models/mlp_mixer.py51 symbols
timm/models/regnet.py47 symbols
timm/models/inception_v3.py44 symbols
timm/models/levit.py42 symbols
timm/models/swin_transformer.py37 symbols

Dependencies from manifests, versioned

evaluate0.4.0 · 1×
huggingface-hub0.4.0 · 1×
nltk3.6.6 · 1×
numpy1.23.2 · 1×
openai0.23.0 · 1×
pandas1.4.3 · 1×
rich13.3.2 · 1×
rouge1.0.1 · 1×
rouge_score0.1.2 · 1×
transformers4.30.0 · 1×

For agents

$ claude mcp add mm-cot \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact