Essay · Engineering

Why we build forensic, not transcription.

Most "meeting AI" tools land on a transcript and stop. Felarity starts where transcription ends.

There is a scene that recurs in every demo we've ever sat through with a meeting-AI vendor. The salesperson plays a recording. The transcript scrolls down the right side of the screen, color-coded by speaker. At the bottom of the call, a panel appears called "summary" or "key takeaways" or "action items." Everyone nods. The product is doing what it says it does.

The problem is that the transcript is right and the summary is wrong.

Not catastrophically wrong. Wrong in the way that matters. The summary collapses three contradictory statements into one bullet. It assigns a commitment to the wrong person because the speaker labels drifted by one turn for ninety seconds in the middle of the call. It elides the moment where one party asked a direct question and another party answered an adjacent question instead of the one asked. The transcript captured every word. The summary lost everything you would have actually wanted to know.

This is the category problem with meeting AI as it is currently sold. The product surface is "we will turn your spoken words into written words, and then we will write a tidy paragraph about it." That is a productivity tool. It is genuinely useful for a project manager who could not attend, or for a salesperson who needs a CRM note. It is not the tool you want when the question you are about to ask of the recording is adversarial.

A different question shape

The questions we built Felarity to answer have a different shape:

Did this witness say something on Tuesday that contradicts what they said on Friday, in a way that survives challenge on the underlying audio?
Which of the four people in this room walked back a position, and at what point in the conversation did the walk-back begin?
When the regulator asked "have you tested this in the last six months," what was the response — not the paraphrase, the response — and did the speaker hesitate before giving it?
Of the seventeen "commitments" the summary tool extracted, how many were actually committed-to by the person they were attributed to, and how many were committed-to by someone else and mis-attributed because the diarization slipped?

None of these questions are answerable by a transcript. They are answerable only by something we started calling, internally, a forensic intelligence layer — a stack whose primary outputs are not text but findings, and whose findings come with the kind of provenance you can defend in a setting where someone is going to push back.

What "forensic" actually means in our build

The word "forensic" gets thrown around in marketing copy until it stops meaning anything. So we'll be specific. When we say Felarity is a forensic tool, we mean four concrete engineering decisions, each of which costs us something we could otherwise sell as a feature.

Chain of custody on the audio itself. Every five-second chunk of audio that arrives at our session endpoint is hashed as it lands, before any model touches it. Those hashes are linked. If a chunk is altered, dropped, or reordered between capture and the final report, the chain breaks and the report says so. A transcription tool does not need this; an evidence-grade tool cannot ship without it.

Ordering matters, and ordering is preserved. The eight stages of our post-session pipeline — capture, transcription, contradiction detection, diarization, attribution, acoustics, topology, final report — are written into the attestation chain in that order. You can re-derive any later stage from any earlier one. You cannot rewrite history by re-running a single stage in isolation. The chain says what was computed, when, and from what input.

Detection happens before attribution. This one matters more than it sounds. When our pipeline notices a contradiction, it notices the contradiction in the raw transcript first, while speakers are still anonymous tokens (SPK_A, SPK_B). Attribution — binding the contradiction to a named human — happens later, as a separate, auditable step. If the diarizer makes a mistake, the contradiction does not disappear; it gets re-bound. The finding survives the error. We will come back to why this matters in a moment.

Signed, with a published key. The final report is signed with an Ed25519 key. The public half of that key is published at /.well-known/felarity-signing-key.pem and at /api/verify/public-key. Anyone — including someone who has never heard of us — can independently verify that a report came from us, unaltered, by fetching the public key and checking the signature against the report bytes. There is no "trust us." There is a key, in the open, and math.

Why detection-before-attribution is the load-bearing decision

If we had to pick the one architectural choice that separates a forensic tool from a transcription tool, it would be this one.

The naive way to build "meeting AI that finds contradictions" is to diarize first, attach speaker labels to every utterance, and then ask a model "did any of these speakers contradict themselves?" That pipeline has a subtle, devastating failure mode: when the diarizer is wrong — and the diarizer is sometimes wrong, especially in three- and four-way conversations with overlap — the contradiction analysis runs over a corrupted input. The model doesn't see "Alice said X and Alice said not-X." It sees "Alice said X and Bob said not-X," which is not a contradiction at all, just a disagreement, which is normal and uninteresting. The finding is lost.

Worse, the inverse failure happens too. The model sees "Alice said X and Bob said not-X" — which the transcript correctly attributed but the diarizer mis-attributed in the chunk where Alice actually said both — and reports a clean two-party disagreement. There is no contradiction in the finding, because the finding was computed on labels that lied.

By running contradiction detection on the raw transcript first, before any speaker label is attached, we get a finding that is robust to diarization error. The finding says: somewhere in this conversation, position X was asserted, and somewhere in this conversation, position not-X was asserted. That is true regardless of who said either part. Attribution is then a separate stage, with its own confidence score and its own audit trail. If the attribution is uncertain, the report says so. The finding does not get silently erased by a label mistake.

This is the inverse of the architecture every transcription-first product ships. It costs us latency. It costs us simplicity. It is the reason the tool can actually be used in a setting where the output will be challenged.

What we don't do

It is probably also worth saying what Felarity is not.

We do not write a tidy summary for the project manager who missed the standup. There are good tools for that. We are not one of them, and trying to be one of them would compromise the things we are actually good at. A productivity-grade summarizer is allowed to elide, to round off, to "infer intent." A forensic tool is not. The two failure modes are incompatible.

We do not "extract action items." Action items are interpretive. The model deciding what counts as an action item is doing exactly the kind of lossy summarization that makes transcripts wrong. We surface contradictions, walk-backs, confrontation/response pairs, hedging clusters, and stress markers — phenomena that have operational definitions in the audio and the transcript, not in the model's guess about intent.

We do not autofile things to your CRM. We do not write follow-up emails on your behalf. We do not promise that the meeting will be "10x more productive." Those are real, useful claims that other tools make. They are not our claims.

Pick a tool by the question you'd ask of its outputs

If the question you are going to ask of a meeting-AI tool's output is "what did we agree to do next week," buy a productivity tool. Most of them are competent. The best ones will save your team real hours.

If the question you are going to ask is "where exactly in the audio did the witness change their answer, and can I prove the audio hasn't been touched between capture and now," buy a forensic one. There are fewer of those. We are one. We built it this way because the people who first asked us for this — investigators, compliance officers, legal teams, regulators, two journalists — could not get an answer they were willing to act on from anything else on the market.

The category test is simple. Look at the outputs. Are they prose summaries? Are they action-item bullets? Or are they findings — discrete, timestamped, attributed-with-confidence, signed, verifiable? The shape of the output tells you what kind of tool you are looking at, and what kind of question you are allowed to ask of it.

Felarity is the second kind. We will probably never be the first kind. That is the choice.

Read next. If the engineering choices in this piece are the kind of thing you want more of, the composition essay walks through how Whisper, pyannote, and DeBERTa fit together in the pipeline, and where each one is and isn't load-bearing.