Feb 19, 2026

Eliminating the Mic Activation Delay in a Real-Time AI Interviewer

Every millisecond of latency in a voice-driven app erodes the user experience. At Intervu, the AI interviewer speaks and then the candidate responds — a natural back-and-forth that should feel like a real conversation. But we had a problem: after the AI finished speaking, there was a noticeable gap before the microphone went live. Candidates would start talking, only to realise the first syllable or two had been lost.

This is the story of how we tracked it down and cut the delay significantly without sacrificing any UX affordances. Try the result at intervu.dev.

The Symptoms

In hands-free mode, Intervu automatically re-activates the microphone after each interviewer turn. Users reported that the mic felt “slow to respond” — they’d start answering and the first word would get clipped.

The debug logs confirmed it. The time between audio.onended (TTS finished) and setIsRecording(true) (mic hot) was consistently around 850ms or more.

That’s a long time in a conversation.

Tracing the Delay Chain

The hands-free restart logic was straightforward:

// Auto-restart recording in hands-free mode
useEffect(() => {
    if (handsFreeMode && !loading && !isPlaying && !isRecording && ...) {
        const timer = setTimeout(() => {
            startRecording();
        }, 600); // "ensure audio playback is fully finished"
        return () => clearTimeout(timer);
    }
}, [isPlaying, ...]);

And inside startRecording:

// Play a "mic going live" beep, then open the WebSocket...
await playMicActivationCue();   // ~250ms (two-tone beep)

const ws = new WebSocket(wsUrl); // open a fresh connection
// The recording only starts when ws.onopen fires
ws.onopen = () => {
    mediaRecorder.start(80);
    setIsRecording(true);
};

The breakdown:

Stage	Time
`setTimeout` buffer	600ms
Activation cue (two beeps)	~250ms
WebSocket handshake	~20–100ms
Total	~850ms+

The 600ms timeout was cargo-culted from an abundance of caution (“let audio fully finish”). The isPlaying state was already accurate — audio.onended fires precisely when the last byte of audio has played. No need for a 600ms insurance policy.

But the bigger culprit was the WebSocket. Every turn, we’d throw away the old connection and open a fresh one — waiting for the handshake to complete before the recorder could even start. Even on localhost this added latency; on a real internet connection it was noticeably worse.

The Fix: Pre-warm the WebSocket During TTS

The key insight: the WebSocket connection is needed after TTS ends, but there’s nothing stopping us from opening it during TTS. The interviewer is talking for a few seconds anyway — plenty of time for the handshake to complete in the background.

We added a prewarmSttSocket function:

const prewarmSttSocket = () => {
    if (!sttStreaming) return;

    // Don't double-open if already connecting/connected
    const existing = sttSocketRef.current;
    if (existing && (existing.readyState === WebSocket.OPEN 
                  || existing.readyState === WebSocket.CONNECTING)) return;

    const ws = new WebSocket(wsUrl);
    sttSocketRef.current = ws;

    ws.onopen = () => console.log("🎵 [STT PREWARM] Socket ready");
    ws.onerror = (e) => {
        if (sttSocketRef.current === ws) sttSocketRef.current = null;
    };
    ws.onclose = () => {
        // Safety: if socket dies before startRecording uses it,
        // clear the ref so the cold path kicks in
        if (sttSocketRef.current === ws) sttSocketRef.current = null;
    };
};

And a useEffect to trigger it as soon as TTS starts playing:

useEffect(() => {
    if (isPlaying && handsFreeMode && sttStreaming && audioConsentStatus === 'granted') {
        prewarmSttSocket();
    }
}, [isPlaying]);

Then in startRecording, we check if we can skip the handshake:

const existingWs = sttSocketRef.current;
const isPreWarmed = existingWs && existingWs.readyState === WebSocket.OPEN;

if (isPreWarmed) {
    // Fast path: socket already open, start recording immediately
    console.log("🎵 [STT STREAMING] Reusing pre-warmed socket");
    mediaRecorder.start(80);
    setIsRecording(true);
} else {
    // Cold path: open a new socket (same as before)
    const prewarmState = existingWs
        ? ['CONNECTING','OPEN','CLOSING','CLOSED'][existingWs.readyState]
        : 'null';
    console.warn(`🎵 [STT STREAMING] Cold path hit (pre-warm state: ${prewarmState})`);
    const ws = new WebSocket(wsUrl);
    // ... wait for onopen, then start
}

We also reduced the setTimeout from 600ms to 150ms — enough for React state to propagate, but no longer a full half-second insurance policy.

What The Console Tells You

Open DevTools and you’ll now see one of two things after each interviewer turn:

✅ Fast path (expected in hands-free mode):

DEBUG: 🎵 [STT PREWARM] Socket pre-warmed and ready
DEBUG: 🎵 [STT STREAMING] [WS] Reusing pre-warmed socket

⚠️ Cold path (TTS was too short for the handshake to complete):

DEBUG: 🎵 [STT STREAMING] [WS] Cold path hit (pre-warm state: CONNECTING)

The CONNECTING state tells you the pre-warm was triggered, just not fast enough — useful signal for tuning if you see it frequently.

Results

	Before	After
Auto-restart timeout	600ms	150ms
WebSocket handshake	~20–100ms (critical path)	0ms (pre-warmed)
Activation cue	~250ms	~250ms (unchanged)
Total to mic-hot	~850ms+	~400ms

The activation cue beep was intentionally kept — it gives candidates a clear audio signal that the mic is live, which reduces anxiety in an already high-stakes setting.

The Graceful Fallback

Pre-warming is purely additive. If the socket isn’t ready (network hiccup, very short TTS turn, first session turn), startRecording detects readyState !== OPEN and falls back to the cold path — identical to the original behavior. No regression, no error handling required beyond what was already there.

The pre-warm socket’s onclose handler also defensively nulls out sttSocketRef if the connection drops before it’s used, ensuring the cold path is taken cleanly.

Lessons Learned

Parallelize what you can: If you know you’ll need a resource soon, start acquiring it early. The interviewer speaking is dead time for the mic — use it.
State is your source of truth: The 600ms buffer existed because someone wasn’t confident isPlaying was accurate. Trusting the state made the unnecessary buffer obvious.
Log the slow path: console.warn on the cold path has already surfaced a few edge cases in testing that would have been invisible otherwise.

Happy interviewing — and may your first syllable never be clipped again.