We built the full stack for a communication platform that translates voice and video in real time. Voice cloning, AI agents, video file processing. Six clients (web, desktop, iOS, Android, Slack, Teams) all running against one real-time backend.
Start a projectGroup audio and video calls where each person speaks their own language and hears everyone else in theirs. Speech recognition, translation, and synthesis all happen in the same pipeline. We got end-to-end latency under 500ms on production calls with speaker tracking across the full conversation.
Users record a short voice sample and the system clones it. When someone speaks English and another participant hears it in French, it still sounds like the original person. We built per-user and per-project voice profiles with async processing and automatic cleanup of stale samples.
AI agents that join calls and conversations as regular participants. They listen, respond, and take actions based on configurable personalities and knowledge bases. For complex scenarios we built multi-agent handoff where one agent handles intake and another handles resolution. The system also tracks sentiment and mood across all participants in real time.
Upload a video in any language and get it back translated with the original speakers' cloned voices. The pipeline extracts audio, separates speakers via diarization, translates each track independently, synthesizes speech using the cloned voice, and reassembles the final video. Subtitle generation included. All of it runs on distributed Celery task queues.
Real-time chat with on-demand message translation, reactions, threading, and file sharing. The same feature set runs on web, Electron desktop, iOS, and Android. WebSocket channels handle instant delivery with read receipts and presence indicators. We also built quick chat rooms for throwaway conversations that don't require registration.
Scheduled meetings with auto-generated summaries and full transcript export. For large events there is a conference mode with speaker roles, registration with approval workflows, live polling, and recording. Calendar integration pulls from Microsoft and Google so scheduling stays in sync.
The API backend. 107 models, 40+ async job types, authentication, payments, and the WebSocket broadcast layer via AnyCable.
A separate Go service we wrote specifically for the translation pipeline. Handles dictation, conversation translation, and voice resolution at the latency Rails could not.
Agent framework, video translation pipeline, speaker diarization, and voice activity detection. Each service is containerized and deploys independently.
Web and desktop apps sharing one component library. The same codebase runs in the browser and as a native desktop application.
Swift and native Android apps with full feature parity. Calls, chat, translation, voice cloning, and agent interactions all work on mobile.
Video and audio infrastructure for multi-participant calls with real-time processing and recording.
In real-time voice translation, anything over half a second breaks the conversation. Rails could not hit the target so we built a dedicated Go service for the translation pipeline. It was the right call. Pick the language that fits the constraint.
Web, desktop, iOS, Android, Slack, and Teams all needed the same features. We learned fast that the API contract is the actual product. Get it right and the clients are straightforward. Get it wrong and you end up maintaining six applications that slowly drift apart.
Wiring up a speech model takes a day. Making it work at scale with speaker diarization, voice cloning, failure recovery, and cost monitoring across 70+ languages is where the real engineering lives. The models are interchangeable. The system around them is what matters.