Language Lexical Similarity
How similar is Spanish to Portuguese? Is Japanese closer to Korean or Chinese? This tool answers those questions using real subtitle data from 20 languages. Each score shows how much vocabulary two languages share. Toggle between the heatmap and graph to explore.
How This Works
Where the data comes from
The scores are calculated from the OpenSubtitles v2024 dataset: a large collection of subtitle translations from films and TV. 500,000 sentence pairs per language were sampled, covering 20 languages and 190 unique pairs.
How the scores are calculated
For every word in English, we look at which foreign words appear as its translation in each language. That gives each language a vocabulary fingerprint. The cosine similarity between two fingerprints is the score. The closer to 1, the more alike two languages look when translated through English.
How reliable are the scores?
Each score comes with a 95% bootstrap confidence interval, a margin of error calculated by re-running the measurement 1,000 times on random subsets of the data. The typical margin is ±0.015, so the rankings and clusters you see are reliable. Hover any cell to see the exact range.
Reading the heatmap
Languages are grouped by default using hierarchical clustering, so pairs that score most alike end up next to each other. Switch to Family sort to group them by linguistic family instead. Darker purple means higher similarity; lighter means lower.
Interacting with the Tool
- Hover a heatmap cell to see the score, confidence interval, and a short note on what the result means.
- Search a language name to highlight its full row and column.
- Threshold slider greys out cells below the selected similarity.
- Switch to Graph view for a force-directed layout; click any node for top-5 neighbours.
Limitations
These scores measure vocabulary overlap through English translation. The cosine similarity calculation normalises for each language's overall relationship with English, so a language being "far from English" in absolute terms does not push its scores down. What remains is a genuine comparison of how similarly two languages carve up the English lexicon. The main limitation is that English vocabulary maps unevenly across language families: Romance and Germanic languages share a lot of English roots, giving their translation vectors denser overlap. Families like Semitic and East Asian have sparser mappings, so their scores carry slightly more uncertainty rather than being systematically wrong. Subtitle language is also informal and conversational, which affects which vocabulary gets counted.
Learn languages with real content
SubSmith lets you study any of these languages using films, TV and your own media. Generate transcripts, mine sentences and build vocabulary that sticks.