Local Meeting Transcriber: Whisper, Ollama, Structured Notes

Contents

You can build a fully local meeting transcriber on Linux. Capture system audio with PipeWire. Transcribe with Faster-Whisper on your GPU. Pipe the transcript to a local LLM through Ollama for structured summaries with names, decisions, and action items. The pipeline runs on 16GB of RAM and a mid-range NVIDIA GPU, and produces notes within seconds of the call ending. No data leaves your network.

Commercial services like Otter.ai and Fireflies.ai route your audio through their servers. If your meetings cover sensitive topics like product plans, HR, or legal reviews, that’s a non-starter. A local pipeline gives you the same structured output, and nothing leaves your building.

Capturing System Audio on Linux

Clean audio is the first thing to sort out. On modern Linux distros (Fedora 41+, Ubuntu 24.04+, Arch), PipeWire is the default audio server, and it makes this easy.

The simplest tool is pw-record. It can target a specific app’s audio output. You need the node ID of your meeting app. Find it by inspecting the PipeWire graph:

pw-record --target $(pw-cli ls Node | grep -A2 "zoom" | grep "id:" | awk '{print $2}') meeting.wav

This grabs only the meeting app’s audio. It ignores system notifications, music, and anything else on the machine. To capture both remote participants and your own mic in one stream, create a virtual combined sink:

pactl load-module module-combine-sink sink_name=meeting_capture \
  slaves=$(pactl list short sinks | grep -m1 "meeting" | cut -f1),$(pactl list short sources | grep -m1 "input" | cut -f1)

Or use pw-loopback. It creates a loopback from the meeting app’s output to a virtual source. Whisper can read it directly for real-time transcription.

Audio format is important for accuracy. Whisper expects 16kHz mono WAV/PCM. If you record to a file first, convert it:

ffmpeg -i meeting_raw.wav -ar 16000 -ac 1 -f wav meeting_16k.wav

For real-time pipelines, pipe through sox to resample on the fly.

One common gotcha is Bluetooth headsets. The A2DP profile gives high-quality audio but the mic is off. You need the HFP/HSP profile, which drops audio to 16kHz but turns the mic on. PipeWire flips this for you if you set bluez5.autoswitch-profile = true in your WirePlumber config. You can also force it with bluetoothctl.

As a backup, OBS Studio with the PipeWire plugin records video and audio at the same time on Wayland. That gives you a screen capture next to the transcript if you need to check visuals later.

Real-Time Transcription with Whisper

Two Whisper implementations are worth considering for local speech-to-text: Whisper.cpp and Faster-Whisper.

OpenAI Whisper architecture diagram showing the encoder-decoder Transformer pipeline that converts audio spectrograms to text tokens — Whisper's multitask training approach: audio is split into 30-second chunks, converted to log-Mel spectrograms, and decoded into text tokens

Image: OpenAI Whisper , MIT License

Faster-Whisper (based on CTranslate2) is the better choice if you have a GPU. The large-v3-turbo model runs 5 to 8x real-time on an RTX 4060 with 8GB VRAM. So a 60-minute meeting transcribes in about 8 to 12 minutes. For short standups, results show up almost at once.

Whisper.cpp works better for CPU-only or low-VRAM boxes. The medium.en model (1.5GB) runs 2 to 3x real-time on a Ryzen 7 7700X with 8 threads. Accuracy is fine for English-only meetings with clear audio.

Here is a practical model selection guide:

Model	Parameters	VRAM	Speed (RTX 4060)	Best For
`large-v3-turbo`	1550M	~6GB	5-8x real-time	Maximum accuracy, multilingual
`medium.en`	769M	~3GB	10-15x real-time	English-only, balanced
`small.en`	244M	~1.5GB	20-30x real-time	Fast results, clear audio

Accuracy drops about 5% Word Error Rate (WER) between each tier. Pick based on your audio quality and the language you need.

Always turn on Voice Activity Detection (VAD). Faster-Whisper ships with Silero VAD. It skips silent segments, and cuts processing time by 30 to 50% for meetings with long pauses or muted spans:

from faster_whisper import WhisperModel

model = WhisperModel("large-v3-turbo", device="cuda", compute_type="float16")
segments, info = model.transcribe(
    "meeting_16k.wav",
    vad_filter=True,
    vad_parameters=dict(min_silence_duration_ms=500)
)

For real-time streaming, chunk audio into 30-second segments with 5-second overlap. Transcribe each chunk as it lands and merge the results. The overlap stops words from being cut at chunk edges.

Speaker Diarization

A transcript without speaker labels is hard to follow. pyannote-audio v3.3 pairs with Whisper to tag who said what. It outputs timestamped segments labeled SPEAKER_00, SPEAKER_01, and so on. You map these to real names in post-processing. Do it by hand, or match them against a voice profile database you build over time.

from pyannote.audio import Pipeline

diarization = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="your_hf_token"
)
result = diarization("meeting_16k.wav")

for turn, _, speaker in result.itertracks(yield_label=True):
    print(f"[{turn.start:.1f}s - {turn.end:.1f}s] {speaker}")

Note that pyannote needs a HuggingFace token and that you agree to their model terms. The diarization step adds about 1 to 2 minutes for a 60-minute recording on GPU.

Summarization with a Local LLM

A raw transcript works as a reference. But scrolling 12,000 words to find one decision is painful. Feed the transcript to a local LLM and you get structured notes with decisions and action items pulled out for you.

Ollama serves local models with almost no setup. For meeting summaries, Llama 3.3 70B with Q4_K_M quantization (about 40GB RAM) gives the best results on dense technical talk. If your box is tighter, Llama 3.2 8B with Q8_0 (about 9GB RAM) still gives solid summaries for plain meetings.

Summary quality leans heavily on the system prompt. A structured prompt that pins the output format works best:

You are a meeting note assistant. Given the following transcript,
produce a JSON object with these keys:
- title (string): descriptive meeting title
- date (string): ISO date
- attendees (list): participant names from the transcript
- summary (string): 3-5 sentence overview
- key_decisions (list): decisions made during the meeting
- action_items (list of objects): each with assignee, task, deadline
- follow_ups (list): items requiring future discussion

For models with big context windows (Llama 3.3 supports 128K tokens), you can pass the full transcript in one prompt. A 90-minute meeting runs about 15,000 to 20,000 tokens, which fits well inside the window. For older or smaller models, chunk the transcript into 4,000-token segments with 500-token overlap. Summarize each chunk on its own, then run a “summary of summaries” pass.

Use the Instructor library with Ollama’s OpenAI-compatible endpoint to check LLM output against a Pydantic schema. It catches bad JSON and retries on its own:

import instructor
from openai import OpenAI
from pydantic import BaseModel

class ActionItem(BaseModel):
    assignee: str
    task: str
    deadline: str | None

class MeetingNotes(BaseModel):
    title: str
    summary: str
    key_decisions: list[str]
    action_items: list[ActionItem]

client = instructor.from_openai(
    OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
)

notes = client.chat.completions.create(
    model="llama3.3:70b",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": transcript_text}
    ],
    response_model=MeetingNotes,
    temperature=0.1,
    top_p=0.9
)

Keep temperature at 0.1 and top_p at 0.9 for summaries. You want stable, factual output, not creative writing. Also leave repeat_penalty at its default. Meeting transcripts have repeated phrases by nature, and a penalty makes the model drop real content.

Building the End-to-End Pipeline

End-to-end local meeting transcription pipeline showing audio capture with PipeWire, transcription with Faster-Whisper, speaker diarization with pyannote, and summarization with Ollama

With each piece working on its own, the next step is to wire them into one script. You start it at the top of a meeting and forget it until notes show up later.

A Python script handles the full flow:

#!/usr/bin/env python3
"""meeting_notes.py - Local meeting transcription and summarization."""

import subprocess
import signal
import sys
from pathlib import Path
from datetime import datetime

NOTES_DIR = Path.home() / "meeting-notes"
NOTES_DIR.mkdir(exist_ok=True)

class MeetingRecorder:
    def __init__(self):
        self.recording_process = None
        self.audio_file = NOTES_DIR / f"raw_{datetime.now():%Y%m%d_%H%M}.wav"

    def start_recording(self, target_node=None):
        cmd = ["pw-record", "--rate", "16000", "--channels", "1"]
        if target_node:
            cmd.extend(["--target", str(target_node)])
        cmd.append(str(self.audio_file))
        self.recording_process = subprocess.Popen(cmd)
        subprocess.run(["notify-send", "Meeting Recorder", "Recording started"])

    def stop_recording(self):
        if self.recording_process:
            self.recording_process.send_signal(signal.SIGINT)
            self.recording_process.wait()
            subprocess.run(["notify-send", "Meeting Recorder", "Recording stopped"])

    def transcribe(self):
        # Run Faster-Whisper transcription
        ...

    def summarize(self, transcript):
        # Call local LLM via Ollama
        ...

    def save_notes(self, notes):
        slug = notes.title.lower().replace(" ", "-")[:50]
        output = NOTES_DIR / f"{datetime.now():%Y-%m-%d}-{slug}.md"
        # Write Markdown with YAML frontmatter
        ...

The script captures audio in a subprocess. When you press Ctrl+C (or send SIGINT), it stops recording, runs transcription, runs diarization, calls the LLM, and writes the Markdown file. The output file uses {date}-{title-slug}.md with YAML frontmatter that holds all the structured metadata.

Desktop Integration

For daily use, make a .desktop file and bind it to a key like Super+M. Use a PID file to toggle recording on and off:

#!/bin/bash
PIDFILE="/tmp/meeting-recorder.pid"
if [ -f "$PIDFILE" ] && kill -0 "$(cat "$PIDFILE")" 2>/dev/null; then
    kill -INT "$(cat "$PIDFILE")"
    rm "$PIDFILE"
else
    python3 ~/scripts/meeting_notes.py &
    echo $! > "$PIDFILE"
fi

A notify-send call confirms start and stop, so you get feedback without switching windows.

Systemd Service for Automatic Recording

For a hands-off setup, create a systemd user service that watches PipeWire for new audio streams from meeting apps:

[Unit]
Description=Meeting Transcriber Auto-Detect

[Service]
ExecStart=/usr/bin/python3 %h/scripts/meeting_monitor.py
Restart=always

[Install]
WantedBy=default.target

The monitor script watches for PipeWire nodes from Zoom, Teams, or Google Meet. It starts recording the moment it sees one. When the node goes away (meeting ends), it kicks off the transcription and summary pipeline.

Resource Management

Watch GPU VRAM during transcription with nvidia-smi. If VRAM runs low, the pipeline should fall back from large-v3-turbo to medium.en to dodge out-of-memory crashes. A quick check before loading the model:

import subprocess

def get_free_vram_mb():
    result = subprocess.run(
        ["nvidia-smi", "--query-gpu=memory.free", "--format=csv,noheader,nounits"],
        capture_output=True, text=True
    )
    return int(result.stdout.strip())

model_name = "large-v3-turbo" if get_free_vram_mb() > 6000 else "medium.en"

Accuracy Optimization and Troubleshooting

Default Whisper settings work fine on clean audio. But meeting audio is rarely clean. Background noise, crosstalk, mixed mic quality, and tech jargon all hurt accuracy. A few tweaks go a long way.

The biggest win comes from seeding domain vocab. Pass an initial prompt with the technical terms your meetings use:

segments, info = model.transcribe(
    "meeting.wav",
    initial_prompt="Meeting about Kubernetes deployment, ArgoCD, Helm charts, gRPC services, PostgreSQL replication"
)

This biases Whisper toward those terms. It stops it from picking close-sounding but wrong words.

Also set the language with language="en" (or your target language) instead of auto-detect. Auto-detect burns compute on the first 30 seconds of audio. It also misreads the language when speakers use jargon or code-switch for a moment.

If your meetings have loud background noise, clean it up before transcription with the noisereduce Python library:

import noisereduce as nr
import soundfile as sf

audio, sr = sf.read("meeting_raw.wav")
reduced = nr.reduce_noise(y=audio, sr=sr)
sf.write("meeting_clean.wav", reduced, sr)

For meetings with echo (common on speakerphone), speexdsp does echo cancellation. It cleans up the audio before Whisper sees it.

A few common issues and how to handle them:

Hallucinated repeating phrases during silence: turn on VAD filtering. Without it, Whisper fills silent gaps with looped phrases or phantom text.
Garbled names and acronyms: add them to initial_prompt. Whisper handles “John” fine but struggles with “Janek” or “RBAC” without hints.
Missing punctuation: use large-v3-turbo. It has much better punctuation than smaller models. Or run a small punctuation restoration model as a post step.
Timestamp drift: cut chunk overlap if you use a streaming pipeline. Too much overlap causes duplicate segments that shift timestamps.

Benchmark your setup before you trust it for real meetings. Transcribe a known recording and compare it to a manual transcript. Compute Word Error Rate (WER) with the jiwer library. Aim for less than 10% WER on clean audio and less than 15% on noisy multi-speaker audio. If you’re above those numbers, work through the steps above until you hit the target.

Where to Go from Here

With PipeWire audio capture, Faster-Whisper transcription, pyannote diarization, and Ollama-served Llama summaries wired together, you have a full local stand-in for commercial transcription services. A 60-minute meeting yields structured notes within 15 minutes on mid-range hardware. Nothing leaves your machine.

Once the basic pipeline works, there are handy extensions. Pipe action items to a task manager like Vikunja via its API. Post the summary to a Slack channel via webhook. Or drop the Markdown file into an Obsidian vault so it joins your searchable knowledge base . If you have recurring meetings like standups or sprint reviews, the pyannote voice profiles get better over time. You can build a name-mapping dictionary that auto-labels speakers without manual work.

Setup takes a few hours and a box with a decent GPU. After that, every meeting you take yields private, structured notes on its own.