Generating Court Transcriptions with Deepgram
Let's Build An AI Court Reporter with Nova-3
In the courtroom, every word matters. Transcripts form the official record of legal proceedings, capturing exactly what was said, by whom, and when. Traditionally, this task falls to highly trained court reporters, experts who transcribe speech in real time using specialized shorthand machines. But with rapid advances in speech recognition technology, one question naturally arises: can modern Speech-to-Text models shoulder some of this responsibility?
Classic Speech-to-Text systems simply convert spoken words into plain text. Today’s models, however, such as Deepgram’s Nova-3 and AssemblyAI’s Universal-2, go far beyond basic transcription.
With features like speaker diarization and timestamps, these models can produce metadata-rich transcripts that mimic the structure, clarity, and reliability of a human court reporter.
In this article, we’ll build an AI Court Reporter using Deepgram’s Nova-3 model, and explore:
How metadata transforms raw transcripts into legally useful records
How to work with the Deepgram API in Python
What does it take to build an AI Court Report?
Court proceedings, whether a trial, hearing, or deposition, are among the most critical processes in the legal system. During these events, a court reporter is responsible for capturing every spoken word verbatim. Skilled human reporters not only achieve high accuracy rates but also understand complex legal jargon.
Their role goes beyond simple transcription. They must record who spoke, what was said, and when it was said. To replicate this with an AI system, we need a Speech-to-Text model capable of three essential things:
Speaker diarization: Automatically identifies and labels different speakers in a conversation (e.g., Speaker 0, Speaker 1). This is crucial in courtroom settings where multiple participants, such as judges, lawyers, and witnesses, speak in turn.
Timestamps: Tags each word or sentence with its start and end times. This allows precise alignment with the original audio, enabling features like searchable playback, real-time synchronization, and legally verifiable transcripts.
Low Word Error Rate (WER): Ensures the AI produces highly accurate transcripts that can be trusted in legal contexts, reducing the risk of misinterpretation or misquotation.
Building an AI Court Reporter with Deepgram
Now that we understand the key capabilities required for an AI court reporter, let’s put them into action using Deepgram’s Nova-3 model, which supports both speaker diarization and timestamps.
We’ll build a simple Python CLI tool that takes either a local audio file or a URL to a court proceeding, then transcribes the audio while labeling each speaker and attaching precise timestamps.
Step 1: Install Dependencies
Before we start coding, make sure you have the necessary Python packages installed. We’ll use the Deepgram SDK for transcription and Rich to create a clean, formatted terminal output:
pip install deepgram-sdk rich
Step 2: Import Required Libraries
Next, we’ll import all the libraries needed for our script. We’ll use argparse to handle command-line arguments, the Deepgram SDK classes to interact with the Deepgram API, and components from Rich to format and display results in the terminal.
import argparse
import json
from itertools import groupby
from operator import itemgetter
from pathlib import Path
from deepgram import (
DeepgramClient,
PrerecordedOptions,
FileSource,
)
from rich.console import Console
from rich.table import Table
from rich.progress import Progress
Step 3: Define the Transcription Function
We’ll now create a function that sends audio to Deepgram for transcription. This function accepts either a local file path or a URL, then uses the Nova-3 model with speaker diarization and timestamps enabled. It returns the raw transcription JSON data from Deepgram’s API.
def transcribe_audio(audio_path, api_key):
dg = DeepgramClient(api_key)
opts = PrerecordedOptions(
model="nova-3",
language="en",
smart_format=True,
diarize=True
)
if audio_path.startswith("http://") or audio_path.startswith("https://"):
source = {"url": audio_path}
res = dg.listen.rest.v("1").transcribe_url(source, opts)
else:
with open(audio_path, "rb") as f:
payload: FileSource = {
"buffer": f.read()
}
res = dg.listen.rest.v("1").transcribe_file(payload, opts)
return res
Step 4: Format the Transcription
Once we receive the raw output from Deepgram, we use the build_diarized_transcript
function to organize it into a clean, readable format. This involves grouping words by speaker, extracting their start and end timestamps, and combining them into speaker-specific segments.
def build_diarized_transcript(res):
words = res.results.channels[0].alternatives[0].words
diarized_segments = []
for speaker, group in groupby(words, key=itemgetter("speaker")):
group = list(group)
start = group[0]["start"]
end = group[-1]["end"]
text = " ".join([w["punctuated_word"] for w in group])
diarized_segments.append({
"speaker": f"Speaker {speaker}",
"start": start,
"end": end,
"text": text
})
return diarized_segments
Step 5: Display the Transcripts
We’ll use Rich to neatly format and display the rendered transcripts in a table for better readability:
def print_diarized_table(diarized_segments):
table = Table(show_header=True, header_style="bold magenta")
table.add_column("Start", style="cyan")
table.add_column("End", style="cyan")
table.add_column("Speaker", style="green")
table.add_column("Text", style="white", overflow="fold")
for seg in diarized_segments:
start_time = f"{seg['start']:.2f}s"
end_time = f"{seg['end']:.2f}s"
table.add_row(start_time, end_time, seg["speaker"], seg["text"])
console = Console()
console.print("\n[bold underline]Diarized Transcript[/bold underline]\n")
console.print(table)
Step 6: Run the program
Let’s make the script runnable so we can execute it directly from the command line. We’ll use argparse
to accept the audio file path, the Deepgram API key, and an optional output path for saving the raw Deepgram JSON.
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Transcribe audio with Deepgram Nova-3 and diarization.")
parser.add_argument("audio", help="Path or URL to the audio file")
parser.add_argument("--api_key", required=True, help="Deepgram API key")
parser.add_argument("--save_json", help="Optional path to save raw Deepgram JSON output")
args = parser.parse_args()
console = Console()
with Progress() as progress:
task = progress.add_task("[cyan]Transcribing audio...", total=None)
res = transcribe_audio(args.audio, args.api_key)
progress.update(task, completed=1)
if args.save_json:
with open(args.save_json, "w") as f:
json.dump(res.to_dict(), f, indent=2)
diarized_segments = build_diarized_transcript(res)
print_diarized_table(diarized_segments)
Testing the AI Court Reporter
To test the AI Court Reporter, we will use a clip from Better Call Saul to see how the model can capture both the time stamps and label speakers.
All we have to do is get an audio version of the clip and our Deepgram key, then pass it to the application like this:
python main.py "Better_Call_Saul.mp3" --api_key YOUR_DEEPGRAM_API_KEY
This will give us the following output:
You can get the source code on GitHub: Neurl-LLC/Court-Transcripts-With-Nova
Drawbacks of Using AI for Court Reporting
AI-based transcription is not without limitations. Some of the most common challenges include:
Overlapping speech: When multiple speakers talk at the same time, AI often struggles to separate their voices accurately.
Specialized legal terminology: Court proceedings often contain Latin phrases, case law references, and technical legal terms that speech-to-text models may not recognize without domain-specific training.
Contextual ambiguity: AI lacks the human judgment to interpret sarcasm, implied meaning, or nuanced tone shifts, which can sometimes be relevant in court.
Legal restrictions: Certain courts do not allow digital devices or permit the recording of audio, making AI transcription impossible in those settings.
For now, these drawbacks mean AI is best used as a supplementary court reporter, working alongside humans to improve efficiency and accessibility without replacing the need for human oversight.
Conclusion
By using speech-to-text models that provide extra metadata such as timestamps and diarization, we unlock untapped potential not only in the legal industry but in any field that requires high-quality text data beyond plain transcription. When combined with other AI technologies, this enables capabilities such as:
Automated redaction using language models
Speaker identification through voice recognition
Advanced search systems for fast, precise retrieval
In the legal industry, these capabilities can streamline case preparation, improve evidence review, and ensure greater accuracy in legal documentation.