🔤 tewtoken

A Byte Pair Encoding (BPE) tokeniser built from scratch in pure Python — no HuggingFace, no PyTorch, no magic. Trained on real YouTube transcripts in English + Hindi, making it one of the few open-source bilingual BPE tokenisers built entirely from the ground up.

Built as a learning project to deeply understand how tokenisation works under the hood in LLMs like GPT, LLaMA, and Gemini.

📸 Demo

Screenshot coming soon

✨ Features

✅ Built from scratch — zero ML libraries used
✅ Bilingual — trained on English + Hindi text
✅ 8,000 BPE merge rules learned from real YouTube transcripts
✅ Vocabulary of ~7,900 tokens
✅ Importable as a Python package
✅ 13 utility functions (encode, decode, batch, truncate and more)
✅ Interactive CLI demo

📦 Installation

Option 1 — Install directly from GitHub (recommended):

pip install git+https://github.com/tusharinqueue/tewtoken.git

Option 2 — Clone and use locally:

git clone https://github.com/tusharinqueue/tewtoken.git
cd tewtoken

🚀 Quick Start

from tewtoken import encode, decode, tokenize, count_tokens

# Encode text to token IDs
ids = encode("machine learning is amazing")
print(ids)
# → [2839, 2700, 2506, 368]

# Decode back to text
text = decode(ids)
print(text)
# → "machine learning is amazing"

# See the actual token strings
tokens = tokenize("machine learning is amazing")
print(tokens)
# → ['machine</w>', 'learning</w>', 'is</w>', 'amazing</w>']

# Count tokens
print(count_tokens("machine learning is amazing"))
# → 4

🌐 Bilingual Support (English + Hindi)

from tewtoken import encode, decode

# Hindi works too!
ids = encode("यह एक परीक्षण है")
print(decode(ids))
# → "यह एक परीक्षण है"

Screenshot coming soon

📚 Full API Reference

Function	Description
`encode(text)`	Convert text → list of token IDs
`decode(ids)`	Convert token IDs → text
`tokenize(text)`	Convert text → list of token strings
`count_tokens(text)`	Count number of tokens in text
`encode_batch(texts)`	Encode a list of texts at once
`decode_batch(ids_list)`	Decode a list of token ID lists at once
`truncate(text, max_tokens)`	Truncate text to a max token count
`vocab_size()`	Returns total vocabulary size
`get_vocab()`	Returns full vocabulary as dict
`get_token_id(token)`	Get ID for a single token string
`get_id_token(id)`	Get token string for a single ID
`is_known_token(token)`	Check if a token exists in vocabulary
`encoding_info()`	Returns a summary of the tokeniser

🧠 How BPE Works

BPE starts with individual characters and iteratively merges the most frequent adjacent pairs:

Step 0 (characters):  ['m', 'a', 'c', 'h', 'i', 'n', 'e', '</w>']
Step 1 (merge t+h):   ['t', 'h'] → 'th'
Step 2 (merge th+e):  ['th', 'e'] → 'the'
...
Step N:               'machine' is now a single token

After 8,000 merges, common words become single tokens and rare words are split into meaningful subwords. This is the exact same algorithm used by GPT-2, LLaMA, and most modern LLMs.

📁 Project Structure

tewtoken/
├── data/
│   ├── raw/              ← raw YouTube transcripts (.txt)
│   ├── corpus.txt        ← cleaned + merged training corpus
│   ├── vocab.json        ← learned vocabulary {token: id}
│   └── merges.txt        ← 8,000 BPE merge rules
├── tewtoken/
│   ├── __init__.py       ← package exports
│   ├── train.py          ← cleans + builds corpus
│   ├── vocab.py          ← character vocab + word frequencies
│   ├── bpe.py            ← BPE training loop
│   └── tokeniser.py      ← encode / decode + all utility functions
├── demo/
│   └── main.py           ← interactive CLI demo
├── setup.py
├── LICENSE
└── README.md

🏃 Run the Demo

python demo/main.py

Screenshot coming soon

📊 Training Details

Property	Value
Training data	YouTube transcripts
Languages	English + Hindi
Total transcripts	~100 videos
BPE merges	8,000
Final vocab size	~7,900 tokens
Training time	~320s (pure Python)
Library dependencies	None (stdlib only)

🔍 Interesting Observations

By merge #3, t + h → 'th' was learned
By merge #17, 'the' became a single token
By merge #101, the Hindi virama स + ् → 'स्' was learned — BPE correctly identified the most frequent Hindi character combination
Common English words like you, the, is, and all became single tokens early
Hindi subwords like है, तो, में emerged naturally from frequency

⚡ Why is it slow?

Pure Python BPE takes ~320 seconds for 8,000 merges. Production tokenisers like HuggingFace tokenizers (Rust) or Google SentencePiece (C++) do it in ~2 seconds. The algorithm is identical — the difference is the language. This is intentional: the goal of this project is understanding, not speed.

🗺️ Roadmap

Add Streamlit web demo
Expand corpus (Wikipedia dump for English + Hindi)
Scale to 30k+ merges with larger dataset
Add special tokens ([PAD], [UNK], [BOS], [EOS])
Publish blog post walkthrough

📝 Blog Post

Blog post coming soon

🪪 License

MIT License — see LICENSE for details.

🙋 Author

Tushar — BTech CSE (AI/ML) student building in public.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔤 tewtoken

📸 Demo

✨ Features

📦 Installation

🚀 Quick Start

🌐 Bilingual Support (English + Hindi)

📚 Full API Reference

🧠 How BPE Works

📁 Project Structure

🏃 Run the Demo

📊 Training Details

🔍 Interesting Observations

⚡ Why is it slow?

🗺️ Roadmap

📝 Blog Post

🪪 License

🙋 Author

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.idea		.idea
demo		demo
tewtoken		tewtoken
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

🔤 tewtoken

📸 Demo

✨ Features

📦 Installation

🚀 Quick Start

🌐 Bilingual Support (English + Hindi)

📚 Full API Reference

🧠 How BPE Works

📁 Project Structure

🏃 Run the Demo

📊 Training Details

🔍 Interesting Observations

⚡ Why is it slow?

🗺️ Roadmap

📝 Blog Post

🪪 License

🙋 Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages