appssemble
ServicesBlogCase StudiesAboutContact
Case Studies/AI Communication Platform

Real-time AI translation across every platform

We built the full stack for a communication platform that translates voice and video in real time. Voice cloning, AI agents, video file processing. Six clients (web, desktop, iOS, Android, Slack, Teams) all running against one real-time backend.

Start a project
live callSPEAKER A — ENSTT → TTSvoice cloneSPEAKER B — FR<500mse2eAI AGENTlistening...VOICE CLONESpeaker A voiceSpeaker B voiceLANGUAGESEN → FRDE → ESJA → ENZH → PTPLATFORMSwebdesktopiOSwebdesktopiOSandroidslackteamsandroidslackteams
<500msvoice translation latency
6platforms shipped
70+languages supported
What it does

The full system

01

Live Voice & Video Calls with Translation

Group audio and video calls where each person speaks their own language and hears everyone else in theirs. Speech recognition, translation, and synthesis all happen in the same pipeline. We got end-to-end latency under 500ms on production calls with speaker tracking across the full conversation.

WebRTCReal-time STTLive translationMulti-participant
02

Voice Cloning & Synthesis

Users record a short voice sample and the system clones it. When someone speaks English and another participant hears it in French, it still sounds like the original person. We built per-user and per-project voice profiles with async processing and automatic cleanup of stale samples.

Voice cloningTTS synthesisVoice profilesAsync processing
03

AI Agents in Calls

AI agents that join calls and conversations as regular participants. They listen, respond, and take actions based on configurable personalities and knowledge bases. For complex scenarios we built multi-agent handoff where one agent handles intake and another handles resolution. The system also tracks sentiment and mood across all participants in real time.

AI agentsMulti-agent handoffSentiment trackingConfigurable
04

Video File Translation

Upload a video in any language and get it back translated with the original speakers' cloned voices. The pipeline extracts audio, separates speakers via diarization, translates each track independently, synthesizes speech using the cloned voice, and reassembles the final video. Subtitle generation included. All of it runs on distributed Celery task queues.

Speaker diarizationVideo processingSubtitle generationDistributed tasks
05

Cross-Platform Messaging

Real-time chat with on-demand message translation, reactions, threading, and file sharing. The same feature set runs on web, Electron desktop, iOS, and Android. WebSocket channels handle instant delivery with read receipts and presence indicators. We also built quick chat rooms for throwaway conversations that don't require registration.

WebSocketCross-platformReal-timeQuick chat
06

Meetings & Conferences

Scheduled meetings with auto-generated summaries and full transcript export. For large events there is a conference mode with speaker roles, registration with approval workflows, live polling, and recording. Calendar integration pulls from Microsoft and Google so scheduling stays in sync.

AI summariesConference modePollingCalendar sync
Under the hood

What powers it

Ruby on Rails

The API backend. 107 models, 40+ async job types, authentication, payments, and the WebSocket broadcast layer via AnyCable.

Go WebSocket Server

A separate Go service we wrote specifically for the translation pipeline. Handles dictation, conversation translation, and voice resolution at the latency Rails could not.

Python ML Services

Agent framework, video translation pipeline, speaker diarization, and voice activity detection. Each service is containerized and deploys independently.

React & Electron

Web and desktop apps sharing one component library. The same codebase runs in the browser and as a native desktop application.

Native iOS & Android

Swift and native Android apps with full feature parity. Calls, chat, translation, voice cloning, and agent interactions all work on mobile.

LiveKit

Video and audio infrastructure for multi-participant calls with real-time processing and recording.

Process
Shape-up4-week cyclesDaily syncs
Technologies
Ruby on RailsGoPythonReactElectronSwiftKotlinPostgreSQLRedisLiveKitDocker
Takeaways

What we learned

01

Latency is the whole game

In real-time voice translation, anything over half a second breaks the conversation. Rails could not hit the target so we built a dedicated Go service for the translation pipeline. It was the right call. Pick the language that fits the constraint.

02

Six platforms, one truth

Web, desktop, iOS, Android, Slack, and Teams all needed the same features. We learned fast that the API contract is the actual product. Get it right and the clients are straightforward. Get it wrong and you end up maintaining six applications that slowly drift apart.

03

Models are the easy part

Wiring up a speech model takes a day. Making it work at scale with speaker diarization, voice cloning, failure recovery, and cost monitoring across 70+ languages is where the real engineering lives. The models are interchangeable. The system around them is what matters.

Case studies

More work

10M+events / day
Grovs
Open SourceAttributionDeep Linking
→
500K+app sessions
Semaphr
SaaSiOS SDKAndroid SDK
→
<30sdraft to e-Factura
Incasez
Invoicinge-FacturaSaaS
→
8 RAWframes merged
HDR Plus+
iOSMLTensorFlow
→
100%offline ML
Dezigner
iOSARML
→
Let's talk about your project
[email protected]
CompanyServicesCase StudiesBlogContact
Offices
New York1740 Broadway, 15th Floor, 10019
LondonKemp House, 160 City Road, EC1V 2NX
Cluj-NapocaBlvd. 21 Decembrie 1989, 95-97
SocialLinkedInGitHub
© 2026 appssemble. All rights reserved.
Privacy PolicyCookie PolicyJobsGlossary