hub / github.com/meizhong986/WhisperJAV

github.com/meizhong986/WhisperJAV @v1.8.14 sqlite

repository ↗ · DeepWiki ↗ · release v1.8.14 ↗

5,270 symbols 19,173 edges 364 files 4,322 documented · 82%

README

WhisperJAV

A subtitle generator for Japanese Adult Videos.

What is the idea:

Transformer-based ASR architectures like Whisper suffer significant performance degradation when applied to the spontaneous and noisy domain of JAV. This degradation is driven by specific acoustic and temporal characteristics that defy the statistical distributions of standard training data.

1. The Acoustic Profile

JAV audio is defined by "acoustic hell" and a low Signal-to-Noise Ratio (SNR), characterized by:

Non-Verbal Vocalisations (NVVs): A high density of physiological sounds (heavy breathing, gasps, sighs) and "obscene sounds" that lack clear harmonic structure.
Spectral Mimicry: These vocalizations often possess "curve-like spectrum features" that mimic the formants of fricative consonants or Japanese syllables (e.g., fu), acting as accidental adversarial examples that trick the model into recognizing words where none exist.
Extreme Dynamics: Volatile shifts in audio intensity, ranging from faint whispers (sasayaki) to high-decibel screams, which confuse standard gain control and attention mechanisms.
Linguistic Variance: The prevalence of theatrical onomatopoeia and Role Language (Yakuwarigo) containing exaggerated intonations and slang absent from standard corpora.

2. Temporal Drift and Hallucination

While standard ASR models are typically trained on short, curated clips, JAV content comprises long-form media often exceeding 120 minutes. Research indicates that processing such extended inputs causes contextual drift and error accumulation. Specifically, extended periods of "ambiguous audio" (silence or rhythmic breathing) cause the Transformer's attention mechanism to collapse, triggering repetitive hallucination loops where the model generates unrelated text to fill the acoustic void.

3. The Pre-processing Paradox & Fine-Tuning Risks

Standard audio engineering intuition—such as aggressive denoising or vocal separation—often fails in this domain. Because Whisper relies on specific log-Mel spectrogram features, generic normalization tools can inadvertently strip high-frequency transients essential for distinguishing consonants, resulting in "domain shift" and erroneous transcriptions. Consequently, audio processing requires a "surgical," multi-stage approach (like VAD clamping) rather than blanket filtering.

Furthermore, while fine-tuning models on domain-specific data can be effective, it presents a high risk of overfitting. Due to the scarcity of high-quality, ethically sourced JAV datasets, fine-tuned models often become brittle, losing their generalization capabilities and leading to inconsistent "hit or miss" quality outputs.

WhisperJAV is an attempt to address above failure points. The inference pipelines do:

Acoustic Filtering: Deploys scene-based segmentation and VAD clamping under the hypothesis that distinct scenes possess uniform acoustic characteristics, ensuring the model processes coherent audio environments rather than mixed streams [1-3].
Linguistic Adaptation: Normalizes domain-specific terminology and preserves onomatopoeia, specifically correcting dialect-induced tokenization errors (e.g., in Kansai-ben) that standard BPE tokenizers fail to parse [4, 5].
Defensive Decoding: Tunes log-probability thresholding and no_speech_threshold to systematically discard low-confidence outputs (hallucinations), while utilizing regex filters to clean non-lexical markers (e.g., (moans)) from the final subtitle track [6, 7].

Quick Start

GUI (Recommended for most users)

whisperjav-gui

A window opens. Add your files, pick a mode, click Start.

Command Line

# Basic usage
whisperjav video.mp4

# Specify mode and sensitivity
whisperjav audio.mp3 --mode balanced --sensitivity aggressive

# Process a folder
whisperjav /path/to/media_folder --output-dir ./subtitles

Features

Processing Modes

Seven pipelines, each with different tradeoffs. Scene detection, speech enhancement, and speech segmenter are configurable for all pipelines that support them — the table shows defaults.

Pipeline	Backend	Scene Detection	Speech Enhancer	Speech Segmenter	Best For
faster	Faster-Whisper (turbo)	—	—	—	Speed, clean audio
fast	OpenAI Whisper	Auditok	—	—	General use, mixed quality
balanced	Faster-Whisper	Auditok	Configurable	Silero	Default. Noisy, dialogue-heavy
fidelity	OpenAI Whisper	Auditok	Configurable	Silero	Maximum accuracy, slower
transformers	HuggingFace Kotoba	Optional	Configurable	Optional	Kotoba Japanese models
qwen	Qwen3-ASR	Semantic	Configurable	Silero	Qwen ASR with forced alignment
anime	anime-whisper	Semantic	Configurable	TEN	Anime/JAV-tuned dialogue

Sensitivity Settings

Conservative: Higher thresholds, fewer hallucinations. Good for noisy content.
Balanced: Default. Works for most content.
Aggressive: Lower thresholds, catches more dialogue. Good for whisper/ASMR content.

ChronosJAV

ChronosJAV is a dedicated pipeline for transcribing with models that do not produce their own timestamps — LLMs, Qwen ASR, anime-whisper, Kotoba, and similar. It handles text generation and timestamp alignment as separate stages, so any model that can produce text from audio can be plugged in.

Qwen3-ASR

Uses Qwen3-ASR models (1.7B, 0.6B) for text generation with a local forced aligner for word-level timestamps. Three processing modes:

Mode	How It Works	Best For
Assembly	Text first, then align timestamps. Batches scenes up to 120s.	Most content
Context-Aware	ASR and alignment together on full scenes (30-90s).	More context per utterance
VAD Slicing (default)	Coupled ASR+alignment with step-wise fallback.	More detail, less context

Anime-Whisper

Uses litagin/anime-whisper, a Whisper large-v3 model fine-tuned on anime and JAV dialogue. Greedy decoding with TEN VAD segmentation for tight subtitle timing. Also supports Kotoba v2.0 and v2.1 (lighter models; v2.1 adds punctuation).

Future: LLM-based transcription

The decoupled architecture means any model that generates text from audio can be wired in — including future LLM-based ASR models. New models can be deployed via YAML configuration without pipeline code changes.

Two-Pass Ensemble Mode

Runs your video through two different pipelines and merges results. Different models catch different things.

# Pass 1 with transformers, Pass 2 with balanced
whisperjav video.mp4 --ensemble --pass1-pipeline transformers --pass2-pipeline balanced

# Serial mode: finish each file before starting the next
whisperjav video.mp4 --ensemble --ensemble-serial --pass1-pipeline balanced --pass2-pipeline fidelity

Merge strategies: - pass1_primary (default) / pass2_primary: Prioritize one pass, fill gaps from other - smart_merge: Intelligent overlap detection - full_merge: Combine everything from both passes - pass1_overlap / pass2_overlap: Overlap-aware priority merge - longest: Keep whichever pass produced the longer subtitle for each segment

Ensemble presets: Save, load, and delete named ensemble configurations from the GUI. Reuse your tuned settings across sessions and across different pipeline combinations.

Serial mode (--ensemble-serial): Completes each file fully (Pass 1 → Pass 2 → Merge) before starting the next. See results as they finish instead of waiting for the entire batch.

BYOP: Faster Whisper XXL (v1.8.9+): Use PurfView's Faster Whisper XXL as Pass 2 in ensemble mode. Select "XXL Faster Whisper" as the Pass 2 pipeline, point to your faster-whisper-xxl.exe, and add any extra args. CLI: --pass2-pipeline xxl --xxl-exe /path/to/faster-whisper-xxl.exe

Speech Enhancement

Pre-process audio per-scene after scene detection. Use surgically — audio processing that alters the mel-spectrogram can introduce artefacts.

# ClearVoice denoising (48kHz, best quality)
whisperjav video.mp4 --mode balanced --pass1-speech-enhancer clearvoice

# FFmpeg DSP filters (lightweight, always available)
whisperjav video.mp4 --mode balanced --pass1-speech-enhancer ffmpeg-dsp:loudnorm,denoise

# BS-RoFormer vocal isolation
whisperjav video.mp4 --mode balanced --pass1-speech-enhancer bs-roformer

# Ensemble with different enhancers per pass
whisperjav video.mp4 --ensemble \
    --pass1-pipeline balanced --pass1-speech-enhancer clearvoice \
    --pass2-pipeline transformers --pass2-speech-enhancer none

Available backends:

Backend	Description	Models/Options
`none`	No enhancement (default)	-
`ffmpeg-dsp`	FFmpeg audio filters	`loudnorm`, `denoise`, `compress`, `highpass`, `lowpass`, `deess`
`clearvoice`	ClearerVoice denoising	`MossFormer2_SE_48K` (default), `FRCRN_SE_16K`
`zipenhancer`	ZipEnhancer 16kHz	`torch` (GPU), `onnx` (CPU)
`bs-roformer`	Vocal isolation	`vocals`, `other`

Output Formats

SRT (default) and WebVTT for HTML5 video players:

whisperjav video.mp4 --output-format vtt
whisperjav video.mp4 --output-format both    # generates .srt and .vtt

Also available as a dropdown in the GUI Advanced Options tab.

GUI

The GUI has four tabs:

Transcription Mode: Pipeline, sensitivity, model, language
Advanced Options: Output format, scene detection method, debug settings
Ensemble Mode: Two-pass configuration with presets, serial mode, and per-pass parameter customization
AI SRT Translate: Translate existing subtitle files

Settings persist across application restarts.

AI Translation

Generate subtitles and translate them in one step:

# Generate and translate
whisperjav video.mp4 --translate

# Or translate existing subtitles
whisperjav-translate -i subtitles.srt --provider deepseek

Supports Ollama (local, recommended), DeepSeek (cheap), Gemini (free tier), Claude, GPT-4, OpenRouter, GLM, Groq, and local LLMs.

Ollama Translation (Recommended for Local)

Run translation locally using Ollama — no cloud API, no API key required:

whisperjav-translate -i subtitles.srt --provider ollama

OllamaManager auto-starts the server, detects your GPU, and picks the best model for your VRAM:

VRAM	Recommended Model
CPU only	qwen2.5:3b
8 GB	qwen2.5:7b
12 GB	gemma3:12b
16 GB+	qwen2.5:14b

Local LLM Translation (Legacy)

Run translation entirely on your GPU — no cloud API, no API key required:

whisperjav-translate -i subtitles.srt --provider local

Zero-Config Setup: On first use, WhisperJAV automatically downloads and installs llama-cpp-python (~700MB). No manual installation needed. Batch size auto-adjusts to your model's context window.

Available models: | Model | VRAM | Notes | |-------|------|-------| | llama-8b | 6GB+ | Default — Llama 3.1 8B | | gemma-9b | 8GB+ | Gemma 2 9B (alternative) | | llama-3b | 3GB+ | Llama 3.2 3B (low VRAM only) | | auto | varies | Auto-selects based on available VRAM |

Resume Support: If translation is interrupted, just run the same command again. It automatically resumes from where it left off using the .subtrans project file.

Supported Input Formats

Any format FFmpeg can read: MP4, MKV, AVI, MOV, WMV, FLV, WAV, MP3, FLAC, M4A, M4B (audiobooks), and many more.

What Makes It Work for JAV

Scene Detection

Splits audio at natural breaks instead of forcing fixed-length chunks. This prevents cutting off sentences mid-word.

Four methods are available: - Semantic (default): Texture-based clustering using MFCC features, groups acoustically similar segments together - Auditok: Energy-based detection, fast and reliable - Silero: Neural VAD-based detection, better for noisy audio - TEN: Used by ChronosJAV pipeline for tight subtitle timing

Voice Activity Detection (VAD)

Identifies when someone is actually speaking vs. background noise or music. Reduces false transcriptions during quiet moments.

Japanese Post-Processing

Handles sentence-ending particles (ね, よ, わ, の)
Preserves aizuchi (うん, はい, ええ)
Recognizes dialect patterns (Kansai-ben, feminine/masculine speech)
Filters out common Whisper hallucinations

Hallucination Removal

Whisper sometimes generates repeated text or phrases that weren't spoken. WhisperJAV detects and removes these patterns.

Content-Specific Recommendations

Content Type	Pipeline	Sensitivity	Notes
Drama / Dialogue Heavy	balanced

Core symbols most depended-on inside this repo

get

called by 1965

whisperjav/config/v4/registries/base_registry.py

debug

called by 758

whisperjav/utils/config_editor_gui.py

info

called by 526

whisperjav/utils/config_editor_gui.py

exists

called by 378

whisperjav/config/v4/registries/base_registry.py

warning

called by 315

whisperjav/utils/config_editor_gui.py

error

called by 228

whisperjav/utils/config_editor_gui.py

open

called by 159

whisperjav/webview_gui/assets/app.js

log

called by 141

install.py

Shape

Method 3,184

Function 1,263

Class 810

Route 13

Languages

Python96%

TypeScript4%

Modules by API surface

whisperjav/webview_gui/assets/app.js187 symbols

tests/test_acceptance_v1_8_7b0.py140 symbols

tests/test_speech_segmentation.py110 symbols

tests/test_vad_threshold_padding_e2e.py102 symbols

tests/test_installer_comprehensive.py102 symbols

whisperjav/webview_gui/api.py80 symbols

tests/test_update_check_frontend.py79 symbols

tests/test_hardening.py75 symbols

tests/test_gui_settings.py74 symbols

tests/test_qwen_asr.py69 symbols

tests/manual/test_installation_scenarios.py65 symbols

tests/test_compute_type_auto.py57 symbols

Dependencies from manifests, versioned

PyYAML6.0 · 1×

accelerate0.26.0 · 1×

aiofiles1×

av13.0.0 · 1×

colorama1×

datasets3.0.0 · 1×

faster-whisper1.1.0 · 1×

google-genai1.39.0 · 1×

httpx0.27.0 · 1×

huggingface-hub0.25.0 · 1×

imageio2.31.0 · 1×

imageio-ffmpeg0.4.9 · 1×

For agents

$ claude mcp add WhisperJAV \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact