VoiceAPI — Multi-lingual TTS + Voice Cloning (Local)

A fully local multi-lingual Text-to-Speech system supporting 11 Indian languages with 21 voice variants and real-time voice cloning — no cloud APIs, no API keys, no internet connection required after first setup.

All inference runs on your machine using model weights stored in models/.
Voice cloning uses Coqui XTTS v2 (downloaded once, cached locally).

✨ Features

Feature	Detail
🌏 11 Languages	Hindi, Bengali, Marathi, Telugu, Kannada, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English, Gujarati
🎤 21 Voice Variants	Male & Female per language (SYSPIN VITS models)
🧬 Voice Cloning	Upload any 5–30 s WAV → synthesise in that voice via XTTS v2
🎭 Prosody Control	Speed · Pitch · Energy sliders + 9 style presets
⚡ Fast Inference	0.3–0.9 s per utterance on CPU
🖥️ Web UI	Next.js frontend — language picker, clone mode, audio playback + WAV download
🔌 REST API	FastAPI with auto-generated `/docs` (Swagger UI)
📴 Fully Offline	After first model download everything runs without internet

🚀 Quick Start

1 — Clone & install Python deps

git clone https://github.com/harshil748/VoiceAPI
cd VoiceAPI

# Create a virtual environment (recommended)
python3 -m venv tts
source tts/bin/activate        # Windows: tts\Scripts\activate

pip install -r requirements.txt

GPU users: swap the torch line in requirements.txt for the CUDA wheel from pytorch.org before installing.

2 — Start everything with one command

chmod +x start.sh
./start.sh

This script will:

Check Python + Node dependencies (install missing ones automatically)
Start the FastAPI backend on http://localhost:8000
Wait for the API to be healthy
Start the Next.js web UI on http://localhost:3000
Print a summary and keep both processes alive (Ctrl+C stops both)

Service	URL
Web UI	http://localhost:3000
API	http://localhost:8000
API Docs (Swagger)	http://localhost:8000/docs

3 — Or start services individually

API server only:

python start_api.py                        # default: 0.0.0.0:8000
python start_api.py --port 8001            # custom port
python start_api.py --reload               # hot-reload (dev mode)
python start_api.py --preload hi_female    # preload a voice at startup

Web UI only (assumes API is already running):

cd web
npm install          # first time only
npm run dev

🎙️ Voice Cloning

Voice cloning works for: English, Hindi, Bengali, Gujarati, Marathi, Telugu, Kannada.

Via the Web UI

Open http://localhost:3000
Select "Custom Voice Clone" mode
Choose language and style
Upload a .wav file (5–30 s of clean speech)
Click Generate Audio → play or download the .wav

Via the API (Python)

import requests

with open("my_voice.wav", "rb") as f:
    response = requests.post(
        "http://localhost:8000/clone",
        params={
            "text": "नमस्ते, मैं आपकी कैसे मदद कर सकता हूँ?",
            "lang": "hindi",
            "style": "calm",
            "speed": 1.0,
        },
        files={"speaker_wav": f},
    )

with open("output.wav", "wb") as out:
    out.write(response.content)
print("Saved output.wav")

Via cURL

curl -X POST "http://localhost:8000/clone?text=Hello+world&lang=english&style=default" \
     -F "speaker_wav=@my_voice.wav" \
     -o cloned_output.wav

First clone request: XTTS v2 weights (~1.8 GB) are downloaded from HuggingFace and cached in models/ automatically. All subsequent requests are fully offline.

🗣️ Standard Synthesis (no reference audio needed)

import requests

response = requests.post(
    "http://localhost:8000/synthesize",
    json={
        "text": "ನಮಸ್ಕಾರ, ನೀವು ಹೇಗಿದ್ದೀರಿ?",
        "voice": "kn_female",
        "style": "happy",
        "speed": 1.0,
        "pitch": 1.0,
        "energy": 1.0,
    },
)

with open("output.wav", "wb") as f:
    f.write(response.content)

Or use the GET convenience endpoint in a browser:

http://localhost:8000/synthesize/get?text=Hello&voice=en_female&style=calm

📡 API Reference

Endpoints

Method	Path	Description
`GET`	`/`	Welcome message
`GET`	`/health`	Server + engine status
`GET`	`/voices`	All voices with download/load status
`GET`	`/styles`	Style presets and parameter descriptions
`GET`	`/languages`	Supported language codes
`POST`	`/synthesize`	Synthesise text → WAV (JSON body)
`GET`	`/synthesize/get`	Synthesise text → WAV (query params)
`POST`	`/synthesize/stream`	Streaming WAV response
`POST`	`/clone`	Voice cloning via XTTS v2 (multipart)
`GET\|POST`	`/Get_Inference`	Hackathon-spec endpoint (clones when possible)
`POST`	`/preload`	Load a voice into memory
`POST`	`/unload`	Unload a voice from memory
`POST`	`/batch`	Batch synthesise multiple texts

Full interactive docs: http://localhost:8000/docs

POST /clone — Voice Cloning

Parameter	Type	Required	Description
`text`	string (query)	✅	Text to synthesise
`lang`	string (query)	✅	`english`, `hindi`, `bengali`, `gujarati`, `marathi`, `telugu`, `kannada`
`speaker_wav`	file (form)	✅	Reference audio — WAV or MP3, 5–30 s recommended
`style`	string (query)	❌	`default`, `calm`, `happy`, `sad`, `slow`, `fast`, `soft`, `loud`, `excited`
`speed`	float (query)	❌	0.5 – 2.0 (default 1.0)
`pitch`	float (query)	❌	0.5 – 2.0 (default 1.0)
`energy`	float (query)	❌	0.5 – 2.0 (default 1.0)

Response: audio/wav · Headers include X-Duration, X-Sample-Rate, X-Inference-Time

POST /synthesize — Standard TTS

{
  "text": "নমস্কার, আপনি কেমন আছেন?",
  "voice": "bn_female",
  "speed": 1.0,
  "pitch": 1.0,
  "energy": 1.0,
  "style": "calm",
  "normalize": true
}

GET|POST /Get_Inference — Hackathon Spec

# With voice cloning (XTTS-supported language)
curl -G "http://localhost:8000/Get_Inference" \
     --data-urlencode "text=नमस्ते" \
     --data-urlencode "lang=hindi" \
     -F "speaker_wav=@reference.wav" \
     -o output.wav

# Non-XTTS language (uses pre-trained VITS, speaker_wav accepted but not used)
curl -G "http://localhost:8000/Get_Inference" \
     --data-urlencode "text=का बा?" \
     --data-urlencode "lang=bhojpuri" \
     -F "speaker_wav=@reference.wav" \
     -o output.wav

🌐 Supported Languages

Language	Voice Keys	Clone Support	Notes
Hindi	`hi_male`, `hi_female`	✅ XTTS	SYSPIN VITS JIT
Bengali	`bn_male`, `bn_female`	✅ XTTS	SYSPIN VITS JIT
Marathi	`mr_male`, `mr_female`	✅ XTTS	SYSPIN VITS JIT
Telugu	`te_male`, `te_female`	✅ XTTS	SYSPIN VITS JIT
Kannada	`kn_male`, `kn_female`	✅ XTTS	SYSPIN VITS JIT
English	`en_male`, `en_female`	✅ XTTS	text must be lowercase
Gujarati	`gu_mms`	✅ XTTS	Facebook MMS (auto-downloads)
Bhojpuri	`bho_male`, `bho_female`	❌ (VITS only)	Coqui .pth checkpoint
Chhattisgarhi	`hne_male`, `hne_female`	❌ (VITS only)	SYSPIN VITS JIT
Maithili	`mai_male`, `mai_female`	❌ (VITS only)	SYSPIN VITS JIT
Magahi	`mag_male`, `mag_female`	❌ (VITS only)	SYSPIN VITS JIT

🛠️ CLI Reference

# List all voices and download status
python -m src.cli list

# Download a specific voice
python -m src.cli download --voice hi_male

# Download all voices for a language
python -m src.cli download --lang bn

# Download everything
python -m src.cli download --all

# Synthesise from the command line
python -m src.cli synthesize \
  --text "नमस्ते दोस्तों" \
  --voice hi_female \
  --output hello.wav

# Start API server (equivalent to start_api.py)
python -m src.cli serve --port 8000 --reload

📁 Repository Structure

VoiceAPI/
├── src/
│   ├── api.py           # FastAPI REST server (local, no cloud deps)
│   ├── engine.py        # Unified TTS inference engine
│   ├── tokenizer.py     # Indic script tokenisation (VITS-compatible)
│   ├── config.py        # Language / voice / style configurations
│   ├── downloader.py    # HuggingFace model downloader
│   └── cli.py           # Command-line interface
│
├── models/              # All model weights (local)
│   ├── hi_female/       # hi_female_vits_30hrs.pt + chars.txt
│   ├── bn_female/       # bn_female_vits_30hrs.pt + chars.txt
│   ├── bho_female/      # checkpoint_340000.pth + config.json
│   ├── gu_mms/          # Facebook MMS tokeniser (weights auto-downloaded)
│   ├── ...              # (all 20 SYSPIN voices)
│   └── tts_models--multilingual--multi-dataset--xtts_v2/
│                        # XTTS v2 weights (auto-downloaded on first clone)
│
├── web/                 # Next.js frontend
│   ├── app/
│   │   ├── page.js      # Main UI (clone + standard synthesis modes)
│   │   ├── layout.js
│   │   └── globals.css
│   ├── .env.local       # NEXT_PUBLIC_API_BASE=http://localhost:8000
│   └── package.json
│
├── local_tests/         # Integration test suite
├── training/            # Training scripts (VITS fine-tuning)
│
├── start.sh             # One-command launcher (API + Web UI)
├── start_api.py         # API-only launcher with CLI flags
└── requirements.txt     # Python dependencies

⚙️ Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Next.js Web UI (:3000)                   │
│   Clone Mode ──► POST /clone         (XTTS v2 voice clone)  │
│   Standard  ──► POST /synthesize     (VITS / Coqui / MMS)   │
└──────────────────────────┬──────────────────────────────────┘
                           │ HTTP (localhost)
┌──────────────────────────▼──────────────────────────────────┐
│                  FastAPI Server (:8000)                       │
│  src/api.py                                                   │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│                    TTSEngine  (src/engine.py)                 │
│                                                               │
│   JIT .pt models    Coqui .pth models   Facebook MMS          │
│   (19 SYSPIN        (Bhojpuri via       (Gujarati via          │
│    voices)           TTS.Synthesizer)    transformers)         │
│                                                               │
│   XTTS v2  ──────────────────────────────────────────────►   │
│   (voice cloning — weights cached in models/ after 1st use)  │
└──────────────────────────────────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│                      models/  (local disk)                    │
│   SYSPIN VITS .pt  ·  Bhojpuri .pth  ·  MMS config           │
│   XTTS v2 weights  (~1.8 GB, downloaded once)                 │
└──────────────────────────────────────────────────────────────┘

Model Types

Type	Format	Loader	Languages
SYSPIN VITS JIT	`.pt` + `chars.txt`	`torch.jit.load`	Hindi, Bengali, Marathi, Telugu, Kannada, English, Chhattisgarhi, Maithili, Magahi
Coqui Checkpoint	`.pth` + `config.json`	`TTS.Synthesizer`	Bhojpuri
Facebook MMS	HF `VitsModel`	`transformers`	Gujarati
XTTS v2	HF cached weights	`TTS.api.TTS`	Voice cloning (EN, HI, BN, GU, MR, TE, KN)

🔧 Configuration

Change API port

# Via environment variable
API_PORT=8001 WEB_PORT=3001 ./start.sh

# Or directly
python start_api.py --port 8001

Then update web/.env.local:

NEXT_PUBLIC_API_BASE=http://localhost:8001

Preload voices at startup (faster first request)

python start_api.py --preload hi_female en_female bn_female

GPU inference (CUDA)

Install the CUDA PyTorch wheel and the engine will auto-detect the GPU:

pip install torch --index-url https://download.pytorch.org/whl/cu121
python start_api.py  # device: cuda

🧪 Running Tests

# Make sure the API is running first
python start_api.py &

# Run the local integration test suite
cd local_tests
python test_local_api.py

# Test voice cloning across all supported languages
python test_live_clone_all_languages.py

📊 Performance

Metric	Value
Languages	11 Indian languages
Voice variants	21 (male + female)
Inference time	0.3–0.9 s per utterance (CPU)
Sample rate	22 050 Hz (VITS), 16 000 Hz (MMS), 24 000 Hz (XTTS)
XTTS first load	~15–30 s (subsequent: ~5 s)
XTTS model size	~1.8 GB (downloaded once, cached in `models/`)
SYSPIN model size	~320 MB per voice

📦 Dependencies

Package	Purpose
`torch`	Neural network inference
`TTS`	Coqui TTS — Bhojpuri checkpoints + XTTS v2 voice cloning
`transformers`	Facebook MMS Gujarati model
`huggingface-hub`	Model snapshot downloads
`soundfile`	WAV I/O
`librosa`	Pitch shift + time stretch
`fastapi` + `uvicorn`	REST API server
`next` (Node)	Web UI

🙏 Acknowledgments

SYSPIN — VITS model weights for 10 Indian languages
Meta AI — MMS multilingual speech model (Gujarati)
Coqui TTS — XTTS v2 multilingual voice cloning
OpenSLR / Common Voice / IndicTTS — Training datasets

📜 License

Code: MIT License
SYSPIN Models: CC BY 4.0
MMS Models: CC BY-NC 4.0
XTTS v2: Coqui Public Model License

👥 Team

Built by Team VoiceAPI — CHARUSAT University

Harshil Patel
Harnish Patel
Aman Paya

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
diagrams		diagrams
models		models
src		src
tests		tests
training		training
web		web
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
VoiceAPI.pdf		VoiceAPI.pdf
VoiceTech Technical Writeup Template.docx		VoiceTech Technical Writeup Template.docx
Voicetech API Specification.pdf		Voicetech API Specification.pdf
download_models.py		download_models.py
generate_diagrams.py		generate_diagrams.py
requirements.txt		requirements.txt
runtime.txt		runtime.txt
start.sh		start.sh
start_api.py		start_api.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VoiceAPI — Multi-lingual TTS + Voice Cloning (Local)

✨ Features

🚀 Quick Start

1 — Clone & install Python deps

2 — Start everything with one command

3 — Or start services individually

🎙️ Voice Cloning

Via the Web UI

Via the API (Python)

Via cURL

🗣️ Standard Synthesis (no reference audio needed)

📡 API Reference

Endpoints

POST /clone — Voice Cloning

POST /synthesize — Standard TTS

GET|POST /Get_Inference — Hackathon Spec

🌐 Supported Languages

🛠️ CLI Reference

📁 Repository Structure

⚙️ Architecture

Model Types

🔧 Configuration

Change API port

Preload voices at startup (faster first request)

GPU inference (CUDA)

🧪 Running Tests

📊 Performance

📦 Dependencies

🙏 Acknowledgments

📜 License

👥 Team

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VoiceAPI — Multi-lingual TTS + Voice Cloning (Local)

✨ Features

🚀 Quick Start

1 — Clone & install Python deps

2 — Start everything with one command

3 — Or start services individually

🎙️ Voice Cloning

Via the Web UI

Via the API (Python)

Via cURL

🗣️ Standard Synthesis (no reference audio needed)

📡 API Reference

Endpoints

POST /clone — Voice Cloning

POST /synthesize — Standard TTS

GET|POST /Get_Inference — Hackathon Spec

🌐 Supported Languages

🛠️ CLI Reference

📁 Repository Structure

⚙️ Architecture

Model Types

🔧 Configuration

Change API port

Preload voices at startup (faster first request)

GPU inference (CUDA)

🧪 Running Tests

📊 Performance

📦 Dependencies

🙏 Acknowledgments

📜 License

👥 Team

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages