A fully local multi-lingual Text-to-Speech system supporting 11 Indian languages with 21 voice variants and real-time voice cloning — no cloud APIs, no API keys, no internet connection required after first setup.
All inference runs on your machine using model weights stored in
models/.
Voice cloning uses Coqui XTTS v2 (downloaded once, cached locally).
| Feature | Detail |
|---|---|
| 🌏 11 Languages | Hindi, Bengali, Marathi, Telugu, Kannada, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English, Gujarati |
| 🎤 21 Voice Variants | Male & Female per language (SYSPIN VITS models) |
| 🧬 Voice Cloning | Upload any 5–30 s WAV → synthesise in that voice via XTTS v2 |
| 🎭 Prosody Control | Speed · Pitch · Energy sliders + 9 style presets |
| ⚡ Fast Inference | 0.3–0.9 s per utterance on CPU |
| 🖥️ Web UI | Next.js frontend — language picker, clone mode, audio playback + WAV download |
| 🔌 REST API | FastAPI with auto-generated /docs (Swagger UI) |
| 📴 Fully Offline | After first model download everything runs without internet |
git clone https://github.com/harshil748/VoiceAPI
cd VoiceAPI
# Create a virtual environment (recommended)
python3 -m venv tts
source tts/bin/activate # Windows: tts\Scripts\activate
pip install -r requirements.txtGPU users: swap the
torchline inrequirements.txtfor the CUDA wheel from pytorch.org before installing.
chmod +x start.sh
./start.shThis script will:
- Check Python + Node dependencies (install missing ones automatically)
- Start the FastAPI backend on
http://localhost:8000 - Wait for the API to be healthy
- Start the Next.js web UI on
http://localhost:3000 - Print a summary and keep both processes alive (Ctrl+C stops both)
| Service | URL |
|---|---|
| Web UI | http://localhost:3000 |
| API | http://localhost:8000 |
| API Docs (Swagger) | http://localhost:8000/docs |
API server only:
python start_api.py # default: 0.0.0.0:8000
python start_api.py --port 8001 # custom port
python start_api.py --reload # hot-reload (dev mode)
python start_api.py --preload hi_female # preload a voice at startupWeb UI only (assumes API is already running):
cd web
npm install # first time only
npm run devVoice cloning works for: English, Hindi, Bengali, Gujarati, Marathi, Telugu, Kannada.
- Open
http://localhost:3000 - Select "Custom Voice Clone" mode
- Choose language and style
- Upload a
.wavfile (5–30 s of clean speech) - Click Generate Audio → play or download the
.wav
import requests
with open("my_voice.wav", "rb") as f:
response = requests.post(
"http://localhost:8000/clone",
params={
"text": "नमस्ते, मैं आपकी कैसे मदद कर सकता हूँ?",
"lang": "hindi",
"style": "calm",
"speed": 1.0,
},
files={"speaker_wav": f},
)
with open("output.wav", "wb") as out:
out.write(response.content)
print("Saved output.wav")curl -X POST "http://localhost:8000/clone?text=Hello+world&lang=english&style=default" \
-F "speaker_wav=@my_voice.wav" \
-o cloned_output.wavFirst clone request: XTTS v2 weights (~1.8 GB) are downloaded from HuggingFace and cached in
models/automatically. All subsequent requests are fully offline.
import requests
response = requests.post(
"http://localhost:8000/synthesize",
json={
"text": "ನಮಸ್ಕಾರ, ನೀವು ಹೇಗಿದ್ದೀರಿ?",
"voice": "kn_female",
"style": "happy",
"speed": 1.0,
"pitch": 1.0,
"energy": 1.0,
},
)
with open("output.wav", "wb") as f:
f.write(response.content)Or use the GET convenience endpoint in a browser:
http://localhost:8000/synthesize/get?text=Hello&voice=en_female&style=calm
| Method | Path | Description |
|---|---|---|
GET |
/ |
Welcome message |
GET |
/health |
Server + engine status |
GET |
/voices |
All voices with download/load status |
GET |
/styles |
Style presets and parameter descriptions |
GET |
/languages |
Supported language codes |
POST |
/synthesize |
Synthesise text → WAV (JSON body) |
GET |
/synthesize/get |
Synthesise text → WAV (query params) |
POST |
/synthesize/stream |
Streaming WAV response |
POST |
/clone |
Voice cloning via XTTS v2 (multipart) |
GET|POST |
/Get_Inference |
Hackathon-spec endpoint (clones when possible) |
POST |
/preload |
Load a voice into memory |
POST |
/unload |
Unload a voice from memory |
POST |
/batch |
Batch synthesise multiple texts |
Full interactive docs: http://localhost:8000/docs
| Parameter | Type | Required | Description |
|---|---|---|---|
text |
string (query) | ✅ | Text to synthesise |
lang |
string (query) | ✅ | english, hindi, bengali, gujarati, marathi, telugu, kannada |
speaker_wav |
file (form) | ✅ | Reference audio — WAV or MP3, 5–30 s recommended |
style |
string (query) | ❌ | default, calm, happy, sad, slow, fast, soft, loud, excited |
speed |
float (query) | ❌ | 0.5 – 2.0 (default 1.0) |
pitch |
float (query) | ❌ | 0.5 – 2.0 (default 1.0) |
energy |
float (query) | ❌ | 0.5 – 2.0 (default 1.0) |
Response: audio/wav · Headers include X-Duration, X-Sample-Rate, X-Inference-Time
{
"text": "নমস্কার, আপনি কেমন আছেন?",
"voice": "bn_female",
"speed": 1.0,
"pitch": 1.0,
"energy": 1.0,
"style": "calm",
"normalize": true
}# With voice cloning (XTTS-supported language)
curl -G "http://localhost:8000/Get_Inference" \
--data-urlencode "text=नमस्ते" \
--data-urlencode "lang=hindi" \
-F "speaker_wav=@reference.wav" \
-o output.wav
# Non-XTTS language (uses pre-trained VITS, speaker_wav accepted but not used)
curl -G "http://localhost:8000/Get_Inference" \
--data-urlencode "text=का बा?" \
--data-urlencode "lang=bhojpuri" \
-F "speaker_wav=@reference.wav" \
-o output.wav| Language | Voice Keys | Clone Support | Notes |
|---|---|---|---|
| Hindi | hi_male, hi_female |
✅ XTTS | SYSPIN VITS JIT |
| Bengali | bn_male, bn_female |
✅ XTTS | SYSPIN VITS JIT |
| Marathi | mr_male, mr_female |
✅ XTTS | SYSPIN VITS JIT |
| Telugu | te_male, te_female |
✅ XTTS | SYSPIN VITS JIT |
| Kannada | kn_male, kn_female |
✅ XTTS | SYSPIN VITS JIT |
| English | en_male, en_female |
✅ XTTS | text must be lowercase |
| Gujarati | gu_mms |
✅ XTTS | Facebook MMS (auto-downloads) |
| Bhojpuri | bho_male, bho_female |
❌ (VITS only) | Coqui .pth checkpoint |
| Chhattisgarhi | hne_male, hne_female |
❌ (VITS only) | SYSPIN VITS JIT |
| Maithili | mai_male, mai_female |
❌ (VITS only) | SYSPIN VITS JIT |
| Magahi | mag_male, mag_female |
❌ (VITS only) | SYSPIN VITS JIT |
# List all voices and download status
python -m src.cli list
# Download a specific voice
python -m src.cli download --voice hi_male
# Download all voices for a language
python -m src.cli download --lang bn
# Download everything
python -m src.cli download --all
# Synthesise from the command line
python -m src.cli synthesize \
--text "नमस्ते दोस्तों" \
--voice hi_female \
--output hello.wav
# Start API server (equivalent to start_api.py)
python -m src.cli serve --port 8000 --reloadVoiceAPI/
├── src/
│ ├── api.py # FastAPI REST server (local, no cloud deps)
│ ├── engine.py # Unified TTS inference engine
│ ├── tokenizer.py # Indic script tokenisation (VITS-compatible)
│ ├── config.py # Language / voice / style configurations
│ ├── downloader.py # HuggingFace model downloader
│ └── cli.py # Command-line interface
│
├── models/ # All model weights (local)
│ ├── hi_female/ # hi_female_vits_30hrs.pt + chars.txt
│ ├── bn_female/ # bn_female_vits_30hrs.pt + chars.txt
│ ├── bho_female/ # checkpoint_340000.pth + config.json
│ ├── gu_mms/ # Facebook MMS tokeniser (weights auto-downloaded)
│ ├── ... # (all 20 SYSPIN voices)
│ └── tts_models--multilingual--multi-dataset--xtts_v2/
│ # XTTS v2 weights (auto-downloaded on first clone)
│
├── web/ # Next.js frontend
│ ├── app/
│ │ ├── page.js # Main UI (clone + standard synthesis modes)
│ │ ├── layout.js
│ │ └── globals.css
│ ├── .env.local # NEXT_PUBLIC_API_BASE=http://localhost:8000
│ └── package.json
│
├── local_tests/ # Integration test suite
├── training/ # Training scripts (VITS fine-tuning)
│
├── start.sh # One-command launcher (API + Web UI)
├── start_api.py # API-only launcher with CLI flags
└── requirements.txt # Python dependencies
┌─────────────────────────────────────────────────────────────┐
│ Next.js Web UI (:3000) │
│ Clone Mode ──► POST /clone (XTTS v2 voice clone) │
│ Standard ──► POST /synthesize (VITS / Coqui / MMS) │
└──────────────────────────┬──────────────────────────────────┘
│ HTTP (localhost)
┌──────────────────────────▼──────────────────────────────────┐
│ FastAPI Server (:8000) │
│ src/api.py │
└──────────────────────────┬──────────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────────┐
│ TTSEngine (src/engine.py) │
│ │
│ JIT .pt models Coqui .pth models Facebook MMS │
│ (19 SYSPIN (Bhojpuri via (Gujarati via │
│ voices) TTS.Synthesizer) transformers) │
│ │
│ XTTS v2 ──────────────────────────────────────────────► │
│ (voice cloning — weights cached in models/ after 1st use) │
└──────────────────────────────────────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────────┐
│ models/ (local disk) │
│ SYSPIN VITS .pt · Bhojpuri .pth · MMS config │
│ XTTS v2 weights (~1.8 GB, downloaded once) │
└──────────────────────────────────────────────────────────────┘
| Type | Format | Loader | Languages |
|---|---|---|---|
| SYSPIN VITS JIT | .pt + chars.txt |
torch.jit.load |
Hindi, Bengali, Marathi, Telugu, Kannada, English, Chhattisgarhi, Maithili, Magahi |
| Coqui Checkpoint | .pth + config.json |
TTS.Synthesizer |
Bhojpuri |
| Facebook MMS | HF VitsModel |
transformers |
Gujarati |
| XTTS v2 | HF cached weights | TTS.api.TTS |
Voice cloning (EN, HI, BN, GU, MR, TE, KN) |
# Via environment variable
API_PORT=8001 WEB_PORT=3001 ./start.sh
# Or directly
python start_api.py --port 8001Then update web/.env.local:
NEXT_PUBLIC_API_BASE=http://localhost:8001
python start_api.py --preload hi_female en_female bn_femaleInstall the CUDA PyTorch wheel and the engine will auto-detect the GPU:
pip install torch --index-url https://download.pytorch.org/whl/cu121
python start_api.py # device: cuda# Make sure the API is running first
python start_api.py &
# Run the local integration test suite
cd local_tests
python test_local_api.py
# Test voice cloning across all supported languages
python test_live_clone_all_languages.py| Metric | Value |
|---|---|
| Languages | 11 Indian languages |
| Voice variants | 21 (male + female) |
| Inference time | 0.3–0.9 s per utterance (CPU) |
| Sample rate | 22 050 Hz (VITS), 16 000 Hz (MMS), 24 000 Hz (XTTS) |
| XTTS first load | ~15–30 s (subsequent: ~5 s) |
| XTTS model size | ~1.8 GB (downloaded once, cached in models/) |
| SYSPIN model size | ~320 MB per voice |
| Package | Purpose |
|---|---|
torch |
Neural network inference |
TTS |
Coqui TTS — Bhojpuri checkpoints + XTTS v2 voice cloning |
transformers |
Facebook MMS Gujarati model |
huggingface-hub |
Model snapshot downloads |
soundfile |
WAV I/O |
librosa |
Pitch shift + time stretch |
fastapi + uvicorn |
REST API server |
next (Node) |
Web UI |
- SYSPIN — VITS model weights for 10 Indian languages
- Meta AI — MMS multilingual speech model (Gujarati)
- Coqui TTS — XTTS v2 multilingual voice cloning
- OpenSLR / Common Voice / IndicTTS — Training datasets
- Code: MIT License
- SYSPIN Models: CC BY 4.0
- MMS Models: CC BY-NC 4.0
- XTTS v2: Coqui Public Model License
Built by Team VoiceAPI — CHARUSAT University
- Harshil Patel
- Harnish Patel
- Aman Paya