Engineering · April 2026

Three models, one pipeline.

A short engineering note on how Whisper, pyannote, and DeBERTa compose in the Felarity pipeline — and why the order is load-bearing.

Most "AI meeting tools" are wrappers around a single transcription model. We wrote Felarity differently because we are not building a transcription tool — we are building a forensic record of what was said, who said it, and where it contradicts itself or the prior record. That requires three specialist models doing three different jobs, in a specific order, with explicit handoffs between them. This note describes how they compose.

Setup

The three models, and where each runs in production:

Whisper large-v3 — speech-to-text. Runs on an RTX 3060 on our audio host. VAD-aware chunked decoding with leading-frame echo handling.
pyannote/speaker-diarization-3.1 — segment boundaries and speaker embeddings. Same physical host, loaded lazily on first diarize=true call; ~10 GB VRAM resident once loaded.
DeBERTa-v3 NLI cross-encoder — a calibrator for contradiction candidates. Runs on CPU on the application host. ~200 ms per (premise, hypothesis) pair; that's fast enough that we don't need a GPU on the application host at all.

There is also a council of larger generative models that does the analytical work — contradiction detection during the meeting, deep re-analysis after — but the three above are the ones that turn raw audio into the structured evidence the council reasons over. The handoff between them is the part people usually get wrong.

1. Whisper: text from audio

We use Whisper large-v3 because it is, empirically, the best openly available transcription model for English meeting audio with cross-talk. The decoding is not vanilla. Two things matter operationally:

VAD-aware chunking. The browser sends a 5-second WebM slice every 5 seconds via MediaRecorder. Each slice has its own header that decoders need to interpret it standalone, so we prepend the init segment to every chunk on the way in. Whisper then sees a valid file and decodes it with its own internal VAD, which means the chunk boundary is not the segment boundary — Whisper will collapse a 4.3-second silence at the start of a chunk into nothing, and split a single long utterance that straddles two chunks into two attributed pieces.

Leading-frame echo handling. Because chunks overlap slightly in the encoder's view of audio context, Whisper occasionally repeats the last sentence of the previous chunk at the start of the next one. We deduplicate two ways: a session-wide seenSentences set keyed by normalised text, and an explicit "strip leading repeats against the previous chunk's tail" pass. Without both, the live transcript drifts about 8% long on a 45-minute meeting, which corrupts everything downstream.

Whisper's output at this stage is text plus timestamps. Crucially, it has no idea who said anything. That's the next model's job, and the gap between "we have words with timestamps" and "we have words with speakers" is where most pipelines lose forensic value.

2. pyannote: segments and embeddings

pyannote-3.1 does two things for us, and we use both:

Segment boundaries. The diarizer returns a list of (start, end, speaker_label) tuples. The labels are local to the session — speaker SPK_00 in one meeting is not the same person as SPK_00 in another, and we never pretend otherwise. What matters is that within a session, the same label consistently refers to the same voice.

Speaker embeddings. pyannote internally produces speaker embeddings to do the clustering. We extract one clean 4-second WAV per detected speaker — chosen for SNR, not for content — and persist it. That sample is what the user hears when they click "play sample" on a credibility card. It is also what lets a workspace owner attach a real human name to SPK_00 in post-session review, without us ever doing identification automatically.

Live diarization runs against the 5-second chunks for fast attribution during the meeting; the post-session pipeline re-runs diarization against the concatenated full-session audio, which is meaningfully more accurate (longer context, more clustering data, full silence model). The two passes occasionally disagree — when they do, the full-session pass wins and the live attribution is corrected in the saved record.

3. DeBERTa: calibrating the council

Live contradiction detection uses our 27B-parameter council model on a GPU server. It is good. It is not perfect. Generative models will sometimes flag stylistic disagreements as contradictions, or miss a subtle one because the contradicted statement was 30 minutes earlier in context.

DeBERTa-v3 NLI is a much smaller cross-encoder fine-tuned specifically on the natural-language-inference task: given a premise and a hypothesis, output entailment, neutral, or contradiction with calibrated probabilities. It runs on CPU at about 200 ms per pair. In the post-session pipeline, every contradiction candidate the council surfaced is re-scored as a (premise, hypothesis) pair. Candidates that DeBERTa rates below threshold are demoted; candidates the council missed but the topology pass surfaces as suspicious are also passed through DeBERTa before being added.

The point is not that DeBERTa is smarter than the council. It isn't. The point is that it is different — a different training objective, a different architecture, a different parameter scale — so when both agree a contradiction exists, you have two independent witnesses. That is the forensic standard we want.

The handoff order

The order matters and is asymmetric. We do contradiction detection before attribution binding, and we sign the record at both points. The reason is exactly what you'd expect from anyone who has handled disputed records before: if a contradiction is only ever observed after a name is attached to it, an adversary can argue the attribution biased the detection. By detecting contradictions on un-attributed text and only then binding them to a speaker, we can attest cryptographically that the contradiction existed in the transcript independent of who said it.

The eight-node attestation chain (audio capture → transcription → contradiction detection (pre-attribution) → diarization → attribution binding → acoustic analysis → topology → final report) reflects this directly. Each node is hashed with the previous node's hash; the whole chain is signed with our Ed25519 production key. The pre-attribution detection node is node 3. The attribution binding node is node 5. You can verify the order from any saved report via POST /api/verify.

Pseudocode

From core/post_session.py, condensed:

def run_post_session(session_id):
    audio = concat_chunks(session_id)                       # node 1
    text  = whisper.transcribe(audio, vad=True)             # node 2
    contradictions = council.detect(text)                   # node 3 (pre-attribution)
    segments = pyannote.diarize(audio)                      # node 4
    text_with_speakers = bind(text, segments)               # node 5
    acoustics = analyze_acoustics(audio, segments)          # node 6
    graph = topology.build(contradictions, text_with_speakers)  # node 7
    contradictions = deberta.recalibrate(contradictions, text_with_speakers)
    contradictions = council.deep_review(
        contradictions, text_with_speakers, acoustics, graph,
    )
    report = assemble(text_with_speakers, contradictions, acoustics, graph)
    chain = attest(  # node 8: 8-node SHA-256 Merkle, Ed25519 signature
        audio, text, contradictions_pre, segments,
        text_with_speakers, acoustics, graph, report,
    )
    return report, chain

The contradictions_pre variable in the attestation call is the contradiction list as it stood at node 3, before binding. We persist both states so the chain can be reconstructed.

What about end-to-end models?

We evaluated several. Joint speech-to-text-plus-diarization models exist and are improving fast. They are not currently a fit for what Felarity is trying to be, for one reason: observability is the product. When a customer asks why a contradiction was flagged, we need to be able to point at the specific node in the chain where the determination was made, show them the inputs, and let them re-run that node independently. A single end-to-end model whose internals are opaque cannot meet that bar, even if its raw accuracy is higher. We will reconsider when an end-to-end model ships with first-class per-stage attestation hooks. We are not holding our breath.

Composition over consolidation

The temptation in 2026 is to put one large model in the middle of every product and stop thinking about it. We don't think that's right for evidentiary work. Three specialised models, in the right order, with explicit handoffs and cryptographic attestation at each boundary, is more code to maintain — and it is exactly the architecture we'd want if we were on the other side of the table trying to verify someone else's transcript. Composition over consolidation. That's the rule.

If you want to see what a signed report looks like, the verifier is open at /trust/verify/ and the Ed25519 public key is at /.well-known/felarity-signing-key.pem.