Preface — who this is for
You have a recording that sounds bad. There’s a hiss behind the voice, or a low hum, or it crackles, or someone’s “s” sounds slice your ears. You’ve heard that software can “fix” this, and maybe you’ve seen people open intimidating programs full of knobs and coloured graphs and somehow make it better. You’d like to understand what’s actually happening — without going back to school for it.
This book is for you. It assumes you know nothing about audio software (a “DAW”) or signal processing (“DSP”). There is no maths you need to follow. Every idea is explained the way you’d explain it to a curious friend, with pictures-in-words and everyday analogies.
It’s organised around a small open-source toolkit called cathar — a program that cleans up audio. Cathar is a good teaching companion for two reasons. First, it does one clear job per tool: one button removes hiss, another removes hum, another rebuilds clipped sound, and so on — so each chapter can be about one honest idea. Second, cathar is transparent: every step it takes is plain, inspectable arithmetic rather than a secret black box, which means we can actually say what it’s doing.
But this is not a manual for cathar. It’s a book about the concepts that all audio-restoration tools share. Whether you end up using cathar, or iZotope RX, or Adobe Audition, or the noise-reduction button in free Audacity, the underlying ideas are the same — and once you understand the ideas, every one of those tools stops being mysterious. So at the end of most chapters there’s a short section called “How the big tools do it” that lines cathar up against the professional and free software the rest of the world uses, and tells you, honestly, where cathar is comparable and where the expensive tools pull ahead.
A note on honesty: audio restoration is repair, not magic. Some damage can be made nearly invisible; some can only be softened; and a few things, once lost, are gone for good. A good engineer knows the difference, and by the end of this book so will you.
Let’s start with the most basic question of all: what is a sound, once it’s inside a computer?
What digital sound actually is
Sound is wiggling air
When something makes a sound — a voice, a guitar string, a slammed door — it pushes the air next to it, which pushes the air next to that, and so on, until a little wave of pressure reaches your eardrum and wiggles it. Your brain reads that wiggle as sound. That’s all sound is: changing air pressure over time.
If you could draw the pressure at your ear from one moment to the next, you’d get a wavy line: up when the air is squeezed, down when it’s thinned out. That wavy line is called a waveform, and it’s the single most important picture in this whole book. Loud sounds make tall wiggles; quiet sounds make small ones. Fast wiggles are high-pitched; slow wiggles are low-pitched.
Turning the wiggle into numbers
A computer can’t store a smooth wiggly line directly. Instead it does something clever and slightly brutal: many thousands of times per second, it measures how high the wave is right now and writes that height down as a number. Then it throws away everything in between.
Each measurement is called a sample. Think of it like a flipbook: a cartoon isn’t really moving, it’s just a stack of still drawings shown fast enough to fool your eye. Digital audio is the same trick for sound — a stack of still “heights,” played back fast enough to fool your ear.
height
+1 ┤ ● ● loud = tall wiggles
│ ● ● ● ● quiet = small wiggles
0 ┼──●───────────●─────●───────────●─────────► time fast = high pitch
│ ● ● slow = low pitch
-1 ┤ ● ● ●
└ each ● is ONE sample: a single measured height.
Join the dots and you get the "waveform".
Two numbers describe how finely the computer captured the sound:
- Sample rate — how many measurements per second. CD audio uses 44,100 per second (written 44.1 kHz); video and pro audio often use 48,000. The more samples per second, the higher the pitches you can capture. (There’s a famous rule: to capture a pitch, you need at least twice as many samples per second as the pitch’s frequency. We’ll meet it again in the resampling chapter.)
- Bit depth — how finely each single measurement is written down: 16 bits per sample for CDs, 24 bits for studios. More bits means a quieter “noise floor” — the faint background fuzz that any digital measurement carries.
Inside cathar (and most modern tools), every sample is stored as a floating- point number between −1.0 and +1.0. −1.0 is the lowest the wave can go, +1.0 the highest, and 0.0 is silence (no pressure change). A whole second of mono CD audio is therefore just a list of 44,100 such numbers. A stereo recording is two such lists, one for the left ear and one for the right.
Why this matters for cleaning up sound
Every restoration tool in this book is, underneath, just arithmetic on that list of numbers. Removing hiss means nudging the numbers; removing a click means replacing a few of them; making something louder means multiplying them all. There is nothing else in the file. When cathar “denoises an interview,” it reads the list, does sums on it, and writes a new list. The art is entirely in which sums, and why.
There is a catch, though, and it sends us straight to the next chapter. Looking at the raw list of heights — the waveform — is a great way to see how loud something is from moment to moment, but a terrible way to see what’s in it. A hiss and a voice and a hum are all jumbled together in the same wiggly line, like three colours of paint stirred into one bucket. To pull them apart, we need a second way of looking at sound.
The two ways to look at sound
This is the most important chapter in the book. Once it clicks, almost every tool in every audio program will suddenly make sense.
The problem with the waveform
The waveform — that wiggly line of heights — tells you when things are loud, but not what they are. A hiss, a hum, a voice and a cymbal are all stirred into the same line. Trying to remove the hiss by editing the waveform is like trying to remove the salt from a soup with a fork.
What you really want is to separate the sound by pitch: put all the low rumble in one pile, the mid-range voice in another, the high hiss in a third. Then you could lower the hiss pile without touching the voice pile. That second view exists, and it’s called the frequency view, or the spectrum.
Splitting sound into pure tones
Here’s the deep idea, discovered by a mathematician named Fourier two centuries ago: any sound, however complicated, can be rebuilt by adding together a bunch of simple, pure tones — like the steady note of a tuning fork — each at its own pitch and its own loudness.
So a voice isn’t one thing; it’s a recipe: “a little bit of this low tone, a lot of this mid tone, a touch of that high tone…” Hiss is its own recipe: “a tiny, even sprinkle of every high tone at once.” A 60-cycle hum is the simplest recipe of all: “one specific low tone, and nothing else.”
The machine that takes a chunk of sound and reads off its recipe — how much of each pitch is present — is the Fourier transform, and the fast version every program uses is the FFT (Fast Fourier Transform). You will see “FFT” in the settings of every serious audio tool. Now you know what it means: split this sound into its ingredient pitches.
The spectrogram: the picture you’ll actually see
A single FFT reads the recipe of one short moment. But sounds change — a voice moves from word to word. So tools chop the audio into many short, overlapping slices (a few hundredths of a second each), take the FFT of every slice, and stack the results side by side. This sliding-window approach has a name — the Short-Time Fourier Transform, or STFT — and its picture is the spectrogram.
A spectrogram is a heat-map of sound: time runs left-to-right, pitch runs bottom (low) to top (high), and brightness shows how much of each pitch is present at each moment. On a spectrogram:
- A hum is a steady horizontal line low down — one pitch, always there.
- A voice is a shifting stack of bands in the middle that wobble as words change.
- Hiss is a faint, even haze across the entire top.
- A click is a thin vertical streak — a single instant where every pitch flares at once.
pitch
high ┤ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ ← hiss: faint haze, everywhere, always
│ ░░░░░░░░░░░ █ ░░░░░░░░░░░░░░░░░░░
mid ┤ ▓▓▓▓ ▓▓▓ █ ▓▓▓▓▓ ▓▓▓▓▓▓ ← voice: bright bands that move
│ ▓▓▓ ▓▓▓▓ █ ▓▓▓ ▓▓▓ with the words
low ┤ ━━━━━━━━━━━━█━━━━━━━━━━━━━━━━━━━ ← hum: one steady line, low down
└──────────────────────────────► time
↑ click: a vertical streak (one instant, all pitches)
Suddenly the soup is unstirred. The hiss, the hum, the voice and the click are in visibly different places. That is why nearly every restoration tool works in the frequency view: in the spectrogram, the problem and the wanted sound usually sit in different spots, so you can lower one without harming the other.
The loop almost everything uses
Most of cathar’s “reduce” tools, and the equivalent tools in every professional program, run the same three-step loop, over and over, on each short slice:
- Analyse — FFT the slice into its recipe of pitches.
- Modify — turn down (or rebuild) the parts you don’t want.
- Resynthesise — add the pitches back together into a cleaned slice, and glue the slices back into a waveform (a careful blend called overlap-add).
flowchart LR
A["one short slice<br/>of the waveform"] --> B["FFT<br/><i>analyse</i><br/>split into pitches"]
B --> C["<b>modify</b><br/>turn down / rebuild<br/>the unwanted pitches"]
C --> D["inverse FFT<br/><i>resynthesise</i><br/>add pitches back"]
D --> E["overlap-add<br/>→ cleaned waveform"]
Cathar uses a 2,048-sample slice with a 75% overlap between neighbours and a gentle taper (a Hann window) so the joins are seamless. Those exact numbers don’t matter to you; the shape of the idea does. Analyse → modify → resynthesise. Hold onto it. Every “de-noise / de-hum / de-reverb / de-ess” tool in this book is a different idea for that middle modify step. The rest is plumbing.
How the big tools do it
Every audio program you’ve heard of lives in this same two-view world. The spectrogram in iZotope RX — the industry-standard restoration suite — is the centrepiece of its interface; you literally paint on it to fix problems. Adobe Audition has a “Spectral Frequency Display” that’s the same idea. Free Audacity shows a spectrogram view too. They all rely on the FFT/STFT loop above. The differences between cheap and expensive tools are almost never about this foundation — they’re about how cleverly the modify step decides what’s noise and what’s signal, which is exactly what the next chapters are about.
The repair toolbox: what breaks, and the “de-” family
Before we open up each tool, here’s the lay of the land — the common ways a recording goes wrong, and the tool that addresses each. You’ll notice almost everything starts with “de-”: de-noise, de-hum, de-click. That little prefix just means “take away.”
| What you hear | What it is | The tool |
|---|---|---|
| A steady background shhhhh | Noise / hiss — even, random energy across the high end | denoise |
| A low hummmm or buzz | Mains hum — electrical 50/60-cycle leakage and its echoes | de-hum |
| Sharp ticks and pops | Clicks — brief spikes (vinyl dust, bad digital edits) | de-click |
| A harsh, fuzzy, “broken speaker” tone on loud bits | Clipping — peaks chopped flat by overload | de-clip |
| It sounds like it’s in a bathroom | Reverb — the room’s echoes smearing the sound | de-reverb |
| Piercing ssss and sshhh | Sibilance — over-loud consonants | de-ess |
| A low whoomph on outdoor recordings | Wind — turbulence rumbling the mic | de-wind |
| A thump on every “p” and “b” | Plosives — breath bursts hitting the mic | de-plosive |
| Scratchy noise when someone moves | Rustle — clothing against a clip-on mic | de-rustle |
| Too quiet / too loud / inconsistent | Level — wrong loudness for delivery | normalize |
| Muffled, “telephone-y” | Lost highs — squashed by heavy compression | enhance |
Two big families
Look closely and the tools split into two families, and the split matters because it tells you what’s possible.
Reducers turn down something unwanted that’s mixed alongside the good sound: hiss, hum, reverb, sibilance, wind. These work in the frequency view from the last chapter — find the unwanted pitches, turn them down, leave the rest. The good news: the wanted sound is still there underneath, so a careful reduction can be nearly invisible. The catch: if the unwanted thing overlaps the wanted thing too much, turning one down dents the other (this is where the “underwater” artefact comes from when people over-do noise reduction).
Repairers rebuild sound that’s been destroyed — clicks that punched a hole in the waveform, or clipping that chopped the tops off. Here the original is genuinely gone, and the tool has to invent a plausible replacement from the surrounding good audio, like an art restorer repainting a scratched corner of a canvas. The good news: done well, you can’t tell. The honest catch: it’s a guess, and the bigger the hole, the more it’s guessing.
flowchart TD
P["a problem in the recording"] --> Q{"is the good sound <b>destroyed</b>,<br/>or just <b>mixed</b> with junk?"}
Q -->|"mixed alongside it"| R["<b>REDUCER</b><br/>find the unwanted pitches,<br/>turn them down<br/><br/>denoise · de-hum · de-reverb<br/>de-ess · de-wind"]
Q -->|"destroyed"| S["<b>REPAIRER</b><br/>invent a plausible<br/>replacement<br/><br/>de-click · de-clip"]
R --> RR["good sound still there<br/>underneath → can be<br/>nearly invisible"]
S --> SS["original is gone → it's an<br/>educated guess; bigger hole,<br/>more guessing"]
Keeping these two families straight saves you a lot of disappointment. Asking a reducer to remove hiss that’s quieter than the voice? Easy. Asking a repairer to perfectly rebuild a badly clipped scream? It’ll help, but don’t expect a miracle — there was no original left to recover.
A golden rule of order
When you chain several fixes, order matters. A good default, and roughly the order cathar’s chapters follow:
- Repair destruction first — de-click, de-clip. (You don’t want later tools analysing damaged samples.)
- Remove steady offenders — de-hum, then de-noise.
- Tame the spectral stuff — de-reverb, de-ess.
- Shape and deliver last — enhance, then set the loudness.
With the map in hand, let’s open the tools one at a time — starting with the most common complaint of all: hiss.
Hiss and noise — denoising
The steady shhhhh behind a recording — tape hiss, a noisy microphone preamp, an air-conditioner, the electrical fuzz of a cheap interface — is the most common complaint in all of audio. Removing it is called denoising, and it’s the clearest example of the “analyse → modify → resynthesise” loop from chapter 2.
The core idea: subtract the haze
Recall that on a spectrogram, hiss looks like a faint, even haze sitting under everything, across many pitches at once, all the time. The voice, by contrast, is bright bands that come and go.
So denoising asks a simple question, pitch by pitch: “how much faint, ever- present haze is at this pitch?” That amount is the noise profile — a measurement of the hiss’s recipe. Once you know it, you go through the sound and, for each pitch in each moment, subtract the haze amount. Where the voice is loud, subtracting a little haze barely changes it. Where there’s only haze (the silences between words), subtracting the haze amount leaves… almost nothing. Silence. That’s the trick, and it has a name: spectral subtraction.
Two questions remain: how do you measure the haze, and how hard do you subtract.
Measuring the haze (the noise profile)
There are two ways, and cathar offers both:
- Learn it from silence. If your recording has a moment of “room tone” — a patch with no voice, just the background — you can point the tool at it and say “this is the noise; memorise its recipe.” Cathar calls this a noiseprint. It’s by far the more accurate method, because you’re showing the tool a clean example of exactly what to remove.
- Guess it automatically. If there’s no clean silence, the tool assumes the quietest moments at each pitch are mostly haze, and builds the profile from those. Cathar does this with a method called minimum statistics. It’s convenient and needs no setup, but it’s a guess, so it’s a little gentler and less surgical than a real noiseprint.
In practice: if you can spare even half a second of “just the room,” learn a noiseprint from it. It is the single biggest quality lever in denoising, in any program.
How hard to subtract (the aggressiveness knob)
Subtract too little and hiss remains. Subtract too much and you start eating into
the voice — and you create a very recognisable artefact: a twinkly, watery,
“underwater” or “musical noise” sound. (It happens because subtraction can leave
isolated little flecks of pitch that warble.) So every denoiser has an
aggressiveness control. In cathar it’s --alpha (gentle around 1.5, strong
around 4–6) plus a floor (--beta) that refuses to ever fully silence a pitch,
which keeps the result natural instead of glassy.
The whole game of denoising is this trade-off: hiss versus artefacts. There is no setting that removes all hiss and adds nothing; the skill is finding the spot where what’s left is less distracting than what you’ve added.
A gentler cousin: the Wiener filter
Instead of bluntly subtracting the haze, you can scale each pitch by how likely
it is to be real sound versus noise: pitches that tower over the haze are kept
almost fully, pitches barely above it are turned right down. This is the Wiener
filter (cathar’s --wiener option). On steady, gentle hiss it often sounds
smoother and less twinkly than plain subtraction. Same goal, slightly different
maths for the “modify” step.
How the big tools do it
This is mature, well-understood territory, and the concepts are identical everywhere:
- Audacity (free) has “Noise Reduction,” which works exactly like cathar’s noiseprint method: you select a quiet bit, click “Get Noise Profile,” then apply. Same idea, same trade-off knobs.
- Adobe Audition offers both a learned-profile “Noise Reduction” and an adaptive “DeNoise” that guesses, like cathar’s two modes.
- iZotope RX is where the money shows. Its “Spectral De-noise” does classic profile-based subtraction very well, but its flagship “Voice De-noise” and the newer AI-powered modes use machine learning — models trained on thousands of hours of speech — to tell voice from noise far more cleverly than any “subtract the haze” rule. That’s the real frontier: the concept is the same, but a trained model makes a smarter decision in the modify step, pulling clean speech out of noise that classical subtraction would smear.
Cathar sits firmly in the classical camp: transparent, predictable, no weights, genuinely good on steady hiss — and honestly outclassed by RX’s ML on the hardest cases (heavy, non-steady background noise like a busy café). Knowing which problem you have tells you which tool you need.
Hum — getting rid of the buzz
That low, steady hummmm under so many recordings has a single, boring villain: the electrical mains. Wall power doesn’t sit still — it alternates back and forth 50 times a second in most of the world, 60 times a second in the Americas. Any nearby cable, cheap power supply, or poorly grounded microphone can leak a little of that alternation into the audio as a pure, relentless tone.
Why it’s actually the easy one
Go back to the spectrogram. Hum is the simplest possible picture: a single razor-thin horizontal line, sitting at exactly 50 (or 60) cycles per second, present from start to finish. It doesn’t move. It doesn’t overlap much with the important parts of a voice. That makes it a sitting duck.
The tool for a single unwanted pitch is a notch filter — think of it as a very narrow pair of scissors that snips out one exact frequency and leaves everything on either side untouched. Tell it “remove 60 cycles per second” and it carves a thin notch there, killing the hum while the voice just above and below sails through.
The harmonics catch
There’s one wrinkle that trips up beginners. Mains hum is rarely just the base tone. The same electrical leakage usually brings along faint copies at exact multiples: 120, 180, 240… (for a 60-cycle hum), or 100, 150, 200… (for 50). These copies are called harmonics, and they’re why hum often sounds more like a “buzz” than a pure tone — your ear hears the whole stack.
So a hum remover doesn’t place one notch; it places a comb of them — one at the base frequency and one at each harmonic up the spectrum.
loudness
│ voice and music live in the gaps — untouched
│ ▁▂▃▅▇█▇▅▃▂▁ ▁▂▃▅▇▇▅▃▂▁ ▁▂▃▅▅▃▂▁
└──┬─────┬─────┬─────┬─────┬─────┬──────► pitch
60 120 180 240 300 360 Hz
V V V V V ← one narrow "notch" snipped at the
hum +harmonics (exact multiples) hum and each of its harmonics
In cathar you say dehum --freq 60 --harmonics 5, and it snips 60, 120, 180,
240, and 300 cycles.
If a hum still buzzes after you remove the base tone, you simply haven’t notched
enough of its harmonics.
The 50 vs 60 gotcha: if
--freq 60doesn’t help, try--freq 50. A recording made in Europe, most of Asia, Africa, or Australia will hum at 50; the Americas and parts of Japan at 60. Guessing wrong does nothing, because the notch lands between the hum’s actual lines.
How the big tools do it
Once again the concept is universal — narrow notches at the fundamental and its harmonics — and the tools differ mostly in convenience:
- Audacity has a “Notch Filter” effect (you place them by hand) and the free “Hum Removal” and Nyquist plug-ins that automate the comb.
- Adobe Audition’s “DeHummer” gives you a tidy panel: pick 50 or 60, choose how many harmonics, done — exactly cathar’s two controls with a nicer face.
- iZotope RX’s “De-hum” adds a smart twist: real mains hum drifts a tiny bit as the power grid fluctuates, and the line isn’t perfectly stable. RX can track that drift and follow the hum, and can learn the exact harmonic fingerprint of a particular buzz. Cathar’s notches are fixed in place, which is perfectly fine for steady hum but can leave a little residue if the hum wanders.
This is the rare corner of audio where the cheap and free tools are genuinely close to the expensive ones, because the problem is so well-defined. Hum is the friendliest enemy in this whole book.
Clicks and clipping — repairing damaged samples
The last two chapters reduced unwanted sound that sat alongside the good stuff. This chapter is different: here the good sound has been destroyed in places, and the tool has to rebuild it. These are the “repairer” tools from chapter 3, and they behave more like an art restorer repainting a damaged canvas than like a filter.
Clicks — tiny holes in the wave
A click is a brief, violent spike: a speck of dust on a vinyl record, a scratch, or a sloppy digital edit that left a sudden jump. On the waveform it’s a single sample (or a few) shooting way out of line with its neighbours; on the spectrogram it’s a thin vertical streak, because an instantaneous spike contains a flash of every pitch at once.
The fix, de-click, is wonderfully intuitive:
- Find the spike. Compare each sample to the typical level of its
neighbours. If one is wildly larger than the local average — say ten times —
it’s almost certainly a click, not real sound. (Cathar’s
--thresholdis exactly this “how many times louder than normal counts as a click” number.) - Cut it out and redraw. Delete the offending samples, leaving a tiny gap, and draw a smooth curve across the hole that connects what came before to what comes after. Cathar uses a gentle curved line (a cubic interpolation) so the patch blends in.
Because a click is only a handful of samples — a fraction of a millisecond — the gap is tiny and the redraw is almost always invisible. De-click is one of restoration’s reliable wins.
Clipping — when the tops get chopped off
Clipping is nastier. Every recording system has a ceiling: the loudest it can represent (that ±1.0 from chapter 1). Push a signal past the ceiling — record too hot, overdrive a preamp — and the system can’t go higher, so it just flattens the peak off. The rounded tops of the wave become flat plateaus. You hear it as a harsh, fuzzy, “broken speaker” distortion on the loud parts.
recorded fine: CLIPPED (overloaded):
ceiling ┄┄┏━━━━━┓┄┄ ← the top is chopped flat;
╭───╮ ┃ ┃ the real curve is GONE
╱ ╲ ╱ ╲
─╯ ╰─ ─╯ ╰─
de-clip's job: guess the missing dotted peak from the slopes either side
┄┄┄╭╴╴╮┄┄┄ ← an invented, plausible curve
╱ ╲ (a guess, not a recovery)
─╯ ╰─
Here’s the cruel part: when the top is flattened, the information about how high the wave really wanted to go is gone. Unlike a click (a brief spike you can delete), clipping erases whole stretches of the true waveform and leaves a flat line where a curve should be. De-clip has to guess the missing peak from the shape of the wave on either side.
How do you guess a peak you can’t see? You use the fact that real sound is predictable: the wiggle approaching the flat top was on a clear trajectory, and the part leaving it continues that trajectory, so you can extrapolate the curve that “should” have been there — rising above the ceiling and coming back down — instead of leaving a plateau. The better the prediction model, the more natural the rebuilt peak.
An honest word about de-clip
De-clip is the hardest tool in this book, and it’s important to set expectations:
- A lightly clipped recording (a few peaks just kissing the ceiling) cleans up beautifully — there’s lots of surrounding curve to predict from, and the gaps are short.
- A badly clipped recording (long flat stretches, a distorted scream) can be softened but never truly restored. The original is gone; the tool is inventing, and across a long flat run even a clever guess drifts. Expect “less harsh,” not “as if it never happened.”
The professional state of the art here is genuinely sophisticated — it treats the missing samples as unknowns and solves for the values that best fit a model of the surrounding sound (an “autoregressive” prediction, the classic method) or that make the result as simple as possible in the frequency view (a modern “sparse reconstruction” approach). These are real mathematics, not a smooth line across the gap, and they’re why a top declipper can rebuild a peak so convincingly.
How the big tools do it
- Audacity has a “Clip Fix” effect that estimates the missing peaks from the surrounding slope — the same basic idea as a simple de-clip.
- Adobe Audition’s “DeClipper” and the click-focused “Automatic Click Remover” handle both problems with adjustable thresholds.
- iZotope RX is, again, the benchmark. Its “De-clip” and “De-click” modules use the advanced prediction/reconstruction methods above and apply them automatically across a whole file; “De-crackle” extends de-click to the dense, continuous crackle of old records. For serious restoration of damaged vinyl or badly clipped masters, RX is the tool the pros reach for.
Cathar’s de-click is solid and reliable, and its de-clip uses the modern “sparse reconstruction” method described above — A-SPADE (Kitić, Bertin & Gribonval, 2015), the same family iZotope-class tools use. It treats the clipped samples as unknowns and solves for the signal that is simplest in the frequency view (sparsest across a windowed, overlapping spectrum) while keeping every reliable sample exact and every clipped sample beyond the threshold — so a peak is rebuilt toward its true height rather than flattened to a plateau. It’s an iterative solve (a little slower than a one-shot fill, and worth it). Light-to- moderate clipping cleans up convincingly; it’s still not a substitute for RX on heavily distorted material — across long flat runs any tool is guessing. As always: knowing how badly something is damaged tells you whether any tool can save it.
Rooms and reverb — taking the echo away
Record someone in a tiled bathroom and they sound like they’re in a tiled bathroom. Record them in a small carpeted booth and they sound “close” and “dry.” The difference is reverb — the thousands of tiny echoes a room adds as sound bounces off the walls, floor, and ceiling before it reaches the mic.
What reverb really is
When you speak, the mic hears two things. First the direct sound — your voice travelling straight to it. Then, a few thousandths of a second later, a flood of reflections — the same sound arriving again and again, having bounced around the room, each copy a little quieter and a little later than the last. That trail of fading echoes is reverb. A big stone hall has a long, obvious trail; a small treated studio has almost none.
Reverb is the trickiest “reducer” in this book, because the echoes are made of the exact same sound as the voice — they’re just delayed, quieter copies. You can’t separate them by pitch the way you separate hiss, because they share the voice’s pitches entirely.
The trick: watch how each pitch fades
So de-reverb uses timing instead of pitch. Here’s the insight. When you start a new word, the direct sound arrives as a sharp onset — a quick rise in energy. Then you stop, but the room keeps ringing: the energy at each pitch decays away in a smooth, tell-tale tail. That decaying tail is the reverb.
A de-reverb tool watches each pitch over time and learns the difference between the punchy onsets (keep these — they’re the real voice) and the lingering decay tails (turn these down — they’re the room). In effect it follows the energy at every pitch and, whenever the energy is just coasting downward toward the room’s background level, it gates it back. The direct, intentional sound survives; the ringing afterglow is suppressed.
Cathar does exactly this with a two-pass scan: first it measures how low each
pitch typically sinks (the “reverb floor”), then it gently gates anything sitting
near that floor. The --strength knob controls how aggressively it chases the
tails.
Why it’s never perfect
Two honest limitations:
- Onsets and tails overlap. Fast speech starts a new word before the previous one’s tail has died, so the tool is always making a judgement call, and pushed hard it can make a voice sound a bit hollow, gated, or “phasey.”
- You can dry a room but not delete it. De-reverb shortens and softens the trail; it can’t put you in a different room. Targeting a modest improvement — “less boomy,” “a bit closer” — gives far nicer results than chasing total removal.
How the big tools do it
- iZotope RX’s “De-reverb” is the leader, and the gap here is large. It uses a learned model of the reverb tail and, in recent versions, machine learning to separate dry voice from room — it can take a startling amount of reverb off a voice while keeping it natural. There’s a separate “Dialogue De-reverb” tuned for speech.
- Acon Digital and Accentize make well-regarded dedicated de-reverb plug-ins used in film post-production, several now ML-based.
- Audition has a “DeReverb” effect; Audacity has no real built-in de-reverb, which tells you how much harder this problem is than hum or hiss.
This is the area where classical, no-model tools like cathar are most outclassed by modern ML, because separating a sound from delayed copies of itself is exactly the kind of “needs a trained ear” task that a learned model does best. Cathar’s gate-the-tails approach gives a real, useful reduction on moderate reverb; for heavy, film-grade de-reverberation, RX is in a different league.
Harsh “S” sounds — de-essing
Some voices, some microphones, and a lot of close-up podcast and voiceover recording produce a piercing, splashy hiss on every “s,” “sh,” “ch,” and “t.” It’s called sibilance, and once you notice it you can’t un-notice it. Taming it is de-essing — and it’s a nice example of a tool that has to act only at certain moments, not all the time.
Why “s” sounds are special
Speech is mostly made down in the low and middle pitches — the body and warmth of a voice. But the sibilant consonants are different: an “s” is essentially a short burst of high-pitched noise, concentrated up near the top of the spectrum (very roughly 4,000–10,000 cycles per second). On a spectrogram, every “s” is a bright little cloud up high, separate from the vocal bands below.
That separation is the key. A de-esser is really just a volume control that only listens to the high end, and only turns down when that high end gets too loud. When you say a vowel, there’s little energy up top, so the de-esser does nothing. When you hit an “s,” the high end spikes, the de-esser notices, and it ducks just that burst by just enough — then lets go. The warmth of the voice below is never touched.
Two controls run the show, and they’re the same in every tool:
- A crossover frequency — the pitch above which the de-esser pays attention
(cathar’s
--freq, default 4,000). Set it where the harsh “ss” lives. - A threshold — how loud the high end has to get before the tool reacts. Too sensitive and it dulls every consonant; too lax and the worst “s” sounds still cut through.
Going multiband and adaptive
There are two refinements that separate a crude de-esser from a good one, and cathar offers both:
- Multiband. Sibilance isn’t one pitch — a sharp “s” and a softer “sh” peak
in different places up top. A multiband de-esser splits the high end into
several sub-bands and watches each one independently, so it can duck the exact
region that’s offending without dulling the rest. (Cathar’s
--bands 4turns this on.) - Adaptive. People get louder and quieter as they talk, so a fixed threshold is wrong half the time. An adaptive de-esser keeps a running sense of how loud each band normally is and reacts to sudden jumps above its own recent average — so it follows the speaker instead of needing constant babysitting.
How the big tools do it
- Every DAW — Logic, Pro Tools, Ableton, Cubase — ships a de-esser plug-in, because sibilance is the single most common vocal-mixing problem. They all work on the crossover-plus-threshold principle above; the better ones are multiband.
- FabFilter Pro-DS and Waves Sibilance are the plug-ins mixing engineers reach for; Pro-DS in particular is prized for sounding transparent because it’s cleverly adaptive and only touches the sibilant energy.
- iZotope RX’s “De-ess” adds spectral precision — it can attenuate the offending high-frequency cloud only where and when it occurs on the spectrogram, which is gentler than turning down a whole band.
De-essing is a place where cathar’s multiband, adaptive approach is genuinely competitive with the mainstream, because the problem is well-bounded and doesn’t need a trained model — it needs to listen to the right pitches at the right moments, which classical DSP does perfectly well.
Wind, pops, and rustle
Three more everyday nuisances, all caused by physical things hitting the microphone rather than by electronics or rooms. They share a theme, which is why they’re together: each is a burst of unwanted energy concentrated in a particular part of the spectrum, and each is removed by acting on that part — sometimes all the time, sometimes only during the burst.
Wind — the low rumble
Record outdoors without a foam or furry “dead-cat” cover and the breeze turbulating across the mic produces a low, blustery rumble — sometimes a roar. Crucially, almost all of that energy sits very low, below the range where speech lives.
That makes the cure simple: a high-pass filter — a tool that lets the high
stuff pass and blocks the low stuff. Set its cutoff at, say, 80 cycles per
second and everything below (the wind rumble) is steeply rolled off while the
voice above is untouched. Cathar’s dewind --cutoff 80 is exactly this, built
from a classic, very steep filter shape (a Butterworth) so the rumble drops
away fast without disturbing the voice just above it. It’s the same high-pass you
hear engineers reach for the instant an outdoor clip starts rumbling.
Plosives — the “p” thumps
Get close to a mic and say “peter piper” and each “p” and “b” fires a little puff of air straight at the capsule, producing a low thump — a plosive. Like wind, a plosive is mostly low-frequency energy — but unlike wind, it’s not constant: it’s a brief burst, only on the plosive consonants.
So instead of filtering all the time, de-plosive watches the low end and acts only when it suddenly thumps: it spots the short bursts of excess low energy and ducks just those moments, leaving the steady low warmth of the voice in between alone. (The physical prevention, by the way, is the round foam ball or mesh “pop filter” you’ve seen in front of studio mics — but when you’re handed a recording that already has the thumps, software has to clean up after the fact.)
Rustle — the clip-on-mic scratch
The little clip-on (lavalier) mics used in interviews and film sit against clothing, and every time the wearer shifts, the fabric scrapes the mic and makes a scratchy rustle. This one is sneakier: it’s a brief burst like a plosive, but it lands in the mid range, right among the consonants of speech, so you can’t just filter it away without dulling the voice.
De-rustle therefore does the same “act only during the burst” trick as de-plosive, but aimed at the mid-range: it watches a band roughly where rustle lives (around 1,500–6,000 cycles) and, when energy there spikes briefly above its normal level, it pulls just that fleeting spike back down, while sustained speech in the same band passes through. It’s the hardest of the three, because the rustle and the wanted consonants are near neighbours.
How the big tools do it
- The high-pass for wind is utterly universal — every DAW channel strip, every mixer, has a low-cut button. Nothing exotic here, in cathar or anywhere.
- For plosives, engineers often just automate a quick low-cut on the offending word, or use a dynamic filter; iZotope RX’s “De-plosive” automates exactly the spot-the-thump-and-duck-the-lows behaviour cathar uses.
- Rustle is genuinely hard, and it’s a showcase for ML: iZotope RX’s “De-rustle” was one of the first ML-driven restoration modules precisely because fabric noise overlaps speech so much that a learned model separates them far better than a rule about bands. Cathar’s transient-suppression approach gives a useful reduction on obvious rustles; deep, speech-tangled rustle is RX’s territory.
Notice the recurring pattern across this whole section: a steady offender (wind) gets a filter that’s always on; a bursty offender (plosive, rustle) gets a watcher that acts only during the burst. That single distinction — always on versus only-when-it-happens — explains an enormous amount of audio software.
How loud is loud? — loudness and LUFS
You’ve cleaned up the recording. Now you have to make it the right loudness to publish — for a podcast, a video, a broadcast, a music stream. This sounds trivial (“just turn it up”) and is secretly one of the most misunderstood topics in audio. Getting it right is normalization, and getting it right the modern way means understanding a unit called LUFS.
“Loud” is not “tall”
The naïve way to set level is to look at the peak — the single tallest sample in the file — and turn everything up until that peak just touches the ceiling. This is peak normalization, and it has a fatal flaw: it tells you nothing about how loud something sounds.
A sudden snare hit and a sustained shout can have the same peak height, yet the shout sounds far louder, because loudness is about how much energy there is over time, not how tall one instant is.
a brief tick a sustained tone
(same PEAK height, but much quieter to the ear)
+1 ┤ █ ████████████████████
0 ┼───█────────── ████████████████████
-1 ┤ █ ████████████████████
one tall spike loud the whole time
PEAK: maxed PEAK: identical
LOUDNESS: low ◄──► LOUDNESS: high
Peak-normalize a quiet, even podcast and a punchy one to the same peak and the punchy one will sound much louder. That’s why, for decades, some adverts felt like they were screaming at you between TV shows: everyone was peak-normalizing and then squashing their audio to be as dense as possible.
LUFS: measuring perceived loudness
The fix was an international standard (its name is ITU-R BS.1770, adopted for broadcast as EBU R128) that measures loudness the way ears experience it, not the way a ruler does. The unit is the LUFS — “Loudness Units relative to Full Scale.” Bigger negative number = quieter. Three ideas make it match hearing:
- It averages energy over time, so a sustained sound reads louder than a brief spike of the same height — exactly as you hear it.
- It weights pitches like your ear does. Your hearing is most sensitive in the upper-mid range and less so at the extremes, so the meter gives those mid-high pitches more say. (This pitch-weighting is called K-weighting.)
- It ignores the silences. Long gaps shouldn’t drag the average down, so the measurement “gates out” the quiet bits and only averages the parts that are actually playing.
The upshot: two pieces of audio at the same LUFS sound equally loud, even if one is a whisper-and-shout drama and the other a steady narrator. That’s why the whole delivery world now specifies loudness in LUFS. Common targets:
| Where it’s going | Target |
|---|---|
| Broadcast TV / radio (EBU R128) | −23 LUFS |
| Podcasts (Apple/Spotify spoken) | −16 LUFS |
| Music streaming (Spotify, YouTube) | −14 LUFS |
Cathar measures true, gated, K-weighted LUFS and turns the whole file up or down
by one amount to hit your target: normalize --target -16.
The true-peak safety net
There’s one last trap. When digital audio is turned back into sound, the player draws a smooth curve through the samples — and that curve can briefly poke higher than any actual sample, between the dots. These hidden overshoots are true peaks (“inter-sample peaks”), and if they cross the ceiling they cause nasty distortion on some devices even though no stored sample looked too loud.
So a proper loudness normalizer doesn’t just hit the LUFS target — it also keeps
the true peak under a safe ceiling (commonly −1 dBTP). Cathar holds the gain
back if pushing for the loudness target would breach that ceiling
(--true-peak -1), trading a hair of loudness for a guarantee it never clips on
playback.
How the big tools do it
- Every broadcast and streaming workflow on earth now runs on LUFS — it’s the law for TV in much of the world. Loudness meters are everywhere: the free Youlean Loudness Meter, Waves WLM, Nugen VisLM, and the meters built into Pro Tools, Logic, and Audition.
- iZotope RX and Ozone include a “Loudness” module that does exactly what cathar does — measure integrated LUFS, normalize to a target, respect a true-peak ceiling — with presets for every platform.
- FFmpeg’s
loudnormfilter is the command-line workhorse the whole web uses for batch-normalizing video and podcast audio; it implements the same BS.1770 standard.
This is a corner where cathar is doing the exact same standardized maths as the professional tools — there’s no ML and no secret sauce in loudness, just a well-defined international measurement. If cathar says −16 LUFS, it means the same −16 that RX, FFmpeg, and a broadcast meter mean.
Sample rate and resampling
Back in chapter 1 we met the sample rate — how many times per second the computer measured the wave. Sometimes you need to change it: a podcast host records at 48,000 samples per second but the platform wants 44,100; an old clip is at 22,050 and you’re mixing it into a 48,000 project. Converting from one rate to another is resampling, and doing it well is subtler than it looks.
Why you can’t just drop or copy samples
The lazy way to go from 48,000 to 24,000 would be to throw away every other sample. The lazy way up would be to duplicate samples. Both wreck the sound, and the reason why is one of the most important rules in all of digital audio.
Remember the rule from chapter 1: to capture a pitch correctly, you need at least twice as many samples per second as that pitch’s frequency. The highest pitch a given rate can hold is therefore half the sample rate — a limit called the Nyquist frequency. At 48,000, you can hold pitches up to 24,000; at 24,000, only up to 12,000.
Now the trap. If you crudely halve the rate to 24,000 but the original still contained, say, a 15,000-cycle pitch — above the new 12,000 ceiling — that pitch doesn’t just disappear. It folds back down and reappears as a wrong, lower pitch, a ghostly tone that was never in the music. This folding is called aliasing, and it sounds like metallic, gritty, “digital” nastiness. (It’s the audio version of why wagon wheels seem to spin backwards in old films — too few “samples” per rotation.)
A fast wiggle, measured too rarely, masquerades as a SLOW one:
the real (fast) wave: /\ /\ /\ /\ /\ /\
we only sample here: ● ● ●
╲ ╱ ╲ ╱
so we "see" this: ╲_______╱ ╲_______╱ ← a wrong, low tone
that was never there
The fix: filter, then convert
So correct downsampling has two parts: first remove every pitch above the new ceiling (so there’s nothing left to fold), then drop to the new rate. And to invent the new in-between sample values smoothly — whether converting up or down — you don’t copy the nearest old sample; you draw the ideal smooth curve through the existing samples and read the new values off it.
The “ideal smooth curve” has a known best shape (mathematicians call the perfect one a sinc function), and a good resampler uses a careful, tapered approximation of it — cathar uses a Kaiser-windowed sinc — that both interpolates cleanly and kills the aliasing in a single pass. You don’t need the maths; you need the moral: good resampling is a smart filter, not a copy-paste, which is why “just change the number” in cheap software can sound worse than the original.
Cathar’s resample --rate 44100 does this properly in both directions, with the
anti-alias filter tracking whichever rate is lower.
A cousin: bandwidth extension
The same family of ideas powers cathar’s enhance tool, which tackles the
opposite problem — sound that’s missing its highs (muffled, “telephone-y,”
because heavy MP3 compression or a low recording rate threw the top away). You
can’t recover what was deleted, but you can synthesize plausible new highs by
taking the texture of the existing upper range and extending it upward, so the
result sounds brighter and more open. It’s an educated fabrication, not a
recovery — useful for rescuing dull material, but it’s adding an informed guess,
not restoring lost detail.
How the big tools do it
- SoX (“Sound eXchange”), the venerable command-line audio swiss-army knife,
has a famously high-quality
rateeffect — its resampler is a reference others are measured against. - libsamplerate (a.k.a. “Secret Rabbit Code”) is the open-source resampling library quietly embedded in countless audio apps; r8brain and iZotope’s resamplers are studio-grade options.
- Every DAW resamples automatically when you drop a 44,100 file into a 48,000 project — usually invisibly and well.
Resampling, like loudness, is settled science: there’s a known-best approach, and the difference between tools is how closely they approximate it and how fast. Cathar’s Kaiser-windowed sinc is a solid, standard implementation in the same family as the references above — no model, no magic, just a well-built filter.
Stereo, channels, and phase
So far we’ve mostly imagined a single stream of samples — mono, one microphone’s worth of sound. But most recordings you meet are stereo: two streams, one for the left ear and one for the right. A couple of ideas about how those two streams relate will save you from some surprisingly common mistakes.
Two channels make a space
Your brain locates sounds partly by comparing what your two ears hear. A sound a little louder and a hair earlier in the left ear is heard as “over to the left.” Stereo recording recreates this: by capturing two slightly different versions of the scene, it lets your ears reconstruct a stereo image — a sense of width and placement, of instruments spread across a stage.
A mono file is just one channel; a player sends it equally to both speakers, so it sits dead centre. A stereo file is two channels, and the difference between them is what creates the width. Keep that word — difference — in mind; it’s the whole point of the next two sections.
A small trap: mono tagged as “left”
Here’s a real-world gotcha that bites people constantly. A mono file is supposed to play equally from both speakers. But the file format has a little label saying which speaker each channel belongs to, and if a tool mislabels a mono file’s one channel as “front-left” instead of “centre/mono,” some players will dutifully send it only to the left speaker — and you’ll swear something is broken, even though the sound itself is perfectly fine and centred.
The audio is balanced; only the label is wrong. (Cathar had exactly this bug once: its mono files were tagged “front-left” and played one-sided until the label was corrected to “centre.”) The lesson for you: if a mono file suddenly plays out of one speaker, suspect the channel label, not the audio — it’s a metadata problem, not a damaged recording.
Why phase matters when you process stereo
Now the subtle one. Suppose you run a reducer — say a denoiser — on a stereo file. The obvious approach is to clean the left channel and the right channel separately. The hidden danger: the tool might decide a faint pitch is “noise” in the left channel but “keep it” in the right, on the very same instant. Now the two channels disagree about that pitch — and remember, the stereo image is the difference between the channels. So the background, the room, the “air” of the recording starts to wander and smear between the speakers as the tool makes different choices left and right. Engineers call this losing phase coherence, and it makes a cleaned stereo recording sound oddly unstable and “swirly” even when each channel sounds fine on its own.
The cure is to make the decision once, jointly, and apply it to both channels
identically — so the channels always agree about what to keep and what to remove,
and the stereo image stays rock-solid. Cathar offers this as a phase-coherent
mode (denoise --coherent): it works out one cleaning decision from the combined
(“mid”) signal and applies that single decision to left and right together. The
image stops wandering.
How the big tools do it
- Serious restoration and mastering tools are careful about stereo by default. iZotope RX processes with stereo coherence in mind and offers mid/side and linked-channel options throughout; mastering suites like Ozone and FabFilter plug-ins expose mid/side processing explicitly.
- The mid/side concept — treating a stereo signal as its “centre” (mid) and its “difference” (side) rather than as left/right — is a standard professional technique for exactly the reason above: it lets you process the shared centre and the stereo width separately and coherently.
Stereo handling is one of those quiet quality markers that separates a tool that “works” from one that’s trustworthy on real material. The concepts — width lives in the difference, and processing should keep the channels agreeing — are the same whether you’re in cathar, RX, or a full mastering chain.
How cathar compares to the big tools
You now understand the concepts. This chapter steps back and places cathar — and the ideas in this book — next to the software the rest of the world uses, so you know which tool fits which job. The goal isn’t to crown a winner; it’s to make you a clear-eyed chooser.
The landscape, in plain terms
Audio software for cleanup falls into a few camps:
- The restoration specialist — iZotope RX. The industry standard for cleaning up dialogue, podcasts, music, and archival audio. Spectrogram-centred, deep, increasingly powered by machine learning. Expensive, and worth it for people who do this for a living.
- The all-rounders — Adobe Audition, and DAWs (Pro Tools, Logic, Ableton, Reaper, Cubase). Audio editors and studios that include restoration tools alongside everything else (recording, mixing, effects). Good, not always best-in-class for restoration.
- The free editor — Audacity. Free and open-source, with genuinely useful noise reduction, click removal, and filters. The place millions of people first clean up a recording.
- The command-line workhorses — SoX and FFmpeg. No window, no buttons — you type a command. Beloved for batch work and automation: converting, resampling, loudness-normalizing thousands of files. FFmpeg in particular quietly powers a huge fraction of the internet’s media processing.
- cathar. A small, open-source, command-line-and-library toolkit in pure Rust — squarely in the SoX/FFmpeg “workhorse” camp by form, but focused on restoration like RX by intent.
What makes cathar different
Three deliberate choices define it:
- Pure, self-contained, no dependencies on the usual giants. Most audio tools lean on big C/C++ libraries (often FFmpeg) under the hood. Cathar is written entirely in Rust and carries its own decoding, maths, and encoding — one build, one self-contained program, nothing to install alongside it.
- No black boxes. Every stage is plain, inspectable arithmetic — the exact methods this book describes. There are no trained neural-network weights making unexplainable decisions. If you don’t like a result, it’s a knob you can turn, with an understandable reason, rather than a model you have to re-roll and hope.
- One clear job per tool, scriptable. Like SoX, it’s built to be driven from the command line and dropped into automated pipelines — clean a thousand files the same way, reproducibly.
What that buys you — and what it costs
Be honest about both sides:
Where cathar holds its own. The settled-science tasks — loudness (LUFS / EBU R128 / true-peak), resampling, de-hum, de-essing, and steady-state hiss reduction — are well-defined problems with known-best classical methods, and cathar implements them properly. For these, it’s genuinely comparable to the big tools, with the bonus of being transparent and automatable. If your job is “batch-normalize 500 podcast episodes to −16 LUFS and notch out a 60-cycle hum,” cathar is exactly the right shape of tool.
Where the expensive tools pull ahead. The hard, perceptual tasks — heavy or non-steady noise (a busy café behind a voice), strong de-reverberation, fabric rustle tangled in speech, badly clipped material — are where machine learning has changed the game. iZotope RX’s learned models separate sound from noise in ways no “subtract the haze” or “gate the tails” rule can match. For professional film, broadcast, and archival restoration of difficult material, RX is the tool, and it isn’t close. Cathar’s classical methods give a real, useful improvement on moderate problems; they are not a substitute for a trained model on the worst cases.
A cheat-sheet for choosing
| If you need to… | Reach for |
|---|---|
| Clean difficult dialogue for film/broadcast | iZotope RX |
| Clean up a recording inside a project you’re already editing | your DAW or Audition |
| Quickly de-noise/de-click one file, for free, with a GUI | Audacity |
| Batch-convert, resample, or loudness-normalize many files | FFmpeg / SoX / cathar |
| Batch restoration (de-hum, de-noise, de-ess, loudness) in a script or pipeline, transparently, in pure Rust | cathar |
| Understand, embed, or extend the actual DSP in your own program | cathar (it’s a library too) |
The real takeaway
The most valuable thing this book gives you isn’t a verdict on cathar — it’s that every one of these tools runs on the same handful of ideas. Analyse into pitches, modify, resynthesise. Subtract the haze. Notch the hum. Redraw the click. Predict the clipped peak. Gate the reverb tail. Measure loudness the way ears hear. Filter, don’t copy, when you resample. Keep the stereo channels agreeing.
Once those ideas are yours, no audio program is a black box anymore — including the ones that cost a fortune. You’ll open RX or Audacity or a DAW, see a panel of knobs, and know what they must be doing, because there are only so many honest ways to clean up a sound. That understanding outlasts any one tool.
Glossary in plain language
Every term you met in this book, defined the way a friend would explain it.
Aliasing — The ghostly, gritty wrong-pitch tones you get when audio is converted to a lower sample rate without first removing the pitches that are too high for the new rate to hold. The audio version of wagon wheels spinning backwards in films.
Bit depth — How finely each single sample is written down (16-bit for CDs, 24-bit for studios). More bits = a quieter background fuzz floor.
Clipping — Distortion caused by a signal trying to go louder than the maximum the system can store, so its peaks get chopped flat. Sounds harsh and “broken.”
DAW (Digital Audio Workstation) — A full audio-editing program like Pro Tools, Logic, Ableton, Reaper, or Audition. The “Photoshop of sound.”
de- (prefix) — Just means “remove”: de-noise, de-hum, de-click.
DSP (Digital Signal Processing) — The umbrella term for doing maths on digital audio (or any signal) to change it: filtering, denoising, all of it.
EBU R128 — The European broadcast loudness standard built on BS.1770; the reason broadcast audio targets −23 LUFS.
FFT (Fast Fourier Transform) — The fast machine that takes a chunk of sound and reads off its “recipe” of pitches. The workhorse behind the frequency view.
Filter — A tool that turns some pitches up or down. A high-pass filter keeps the highs and blocks the lows; a low-pass does the reverse; a notch removes one narrow band.
Frequency — How fast the wave wiggles; what you hear as pitch. Measured in cycles per second, or hertz (Hz). 1,000 Hz = 1 kHz.
Harmonics — Faint copies of a tone at exact whole-number multiples of its pitch. Why hum “buzzes” instead of being a pure tone.
Hum — Low, steady tone leaking in from the electrical mains (50 or 60 cycles per second, plus harmonics).
LUFS — “Loudness Units relative to Full Scale.” The modern unit for perceived loudness — two files at the same LUFS sound equally loud. Targets: −23 broadcast, −16 podcast, −14 streaming.
Mono — A single channel of audio; plays equally from both speakers.
Noise / hiss — Steady, random background energy spread across the high frequencies — the shhhh behind a recording.
Noiseprint / noise profile — A measurement of the recipe of a recording’s background noise, learned from a quiet patch, so a denoiser knows exactly what to subtract.
Normalization — Setting a recording to a target level. Peak normalization aims at the tallest sample (crude); loudness (LUFS) normalization aims at how loud it actually sounds (correct for delivery).
Nyquist frequency — The highest pitch a given sample rate can hold: exactly half the sample rate. Go above it and you get aliasing.
Overlap-add — The careful blending technique that glues the processed short slices of audio back into one seamless waveform.
Phase coherence — Keeping a stereo file’s two channels “agreeing” when you process them, so the stereo image stays stable instead of wandering.
Plosive — The low thump on “p” and “b” sounds when a puff of breath hits the mic.
Resampling — Converting audio from one sample rate to another (e.g. 48,000 → 44,100). Done well, it’s a smart filter, not a copy.
Reverb — The trail of fading echoes a room adds as sound bounces off its surfaces. Makes recordings sound “roomy” or “boxy.”
Rustle — Scratchy mid-range noise from clothing brushing a clip-on (lavalier) microphone.
Sample — One single measurement of the wave’s height. Audio is a long list of these.
Sample rate — How many samples are taken per second (44,100 for CD, 48,000 for video/pro). Higher = can capture higher pitches.
Sibilance — Over-loud, piercing “s,” “sh,” and “ch” sounds; removed by de-essing.
Spectral subtraction — The core denoising method: measure the background haze at each pitch and subtract that amount.
Spectrogram — A heat-map picture of sound: time left-to-right, pitch bottom-to-top, brightness = how much of each pitch is present. Where most restoration tools “see.”
Stereo — Two channels (left and right) whose difference creates a sense of width and placement.
STFT (Short-Time Fourier Transform) — Taking an FFT of many short, overlapping slices in a row, to track how a sound’s pitches change over time. The engine behind the spectrogram.
Threshold — A “how much counts” cutoff: how loud a spike must be to count as a click, or how loud sibilance must get before a de-esser reacts.
True peak (inter-sample peak) — A hidden overshoot in the smooth curve drawn between samples on playback, which can distort even when no stored sample looked too loud. Why loudness tools keep a true-peak safety ceiling (e.g. −1 dBTP).
Waveform — The wiggly line of the wave’s height over time. Great for seeing how loud, poor for seeing what’s in it.
Wiener filter — A gentler denoising method: instead of subtracting the haze, scale each pitch by how likely it is to be real sound versus noise.
Window (Hann window) — The gentle taper applied to each short slice of audio before its FFT, so the slices blend together without clicks at the seams.