Sometimes people in the audio world use "frame" to mean "set of PCM samples per channel". So a typical stereo audio signal at 44100hz would be 44100 "frames" per second, even though it's really 88200 samples per second. The author seems to be using it that way, but it's confusing because he's also talking about sliding a FFT window over by frames.
> If the raw data is just measures of amplitude at a set frequency
LPCM is literally amplitude samples at a rate. Thinking of the sample rate as a 'set frequency' will lead to confusion (even though it obviously is a frequency). When you're thinking of samples as sequential amplitudes, you're thinking in the "time domain". For oscillations of the signal, that's the called the "frequency domain". Fourier transform is how you convert from time domain to frequency domain.
> don't you have to pick an arbitrary time slice to examine (and so risk losing lower-frequency sounds)?
You need at least 2 samples to make an audible frequency. If you only had 1 sample, you wouldn't hear anything, because nothing would be moving. So at 44100hz of sampling frequency, you can capture 0hz to 22050hz of audio frequency. That's called the Nyquist frequency, and it's always half of the sample rate.
> You need at least 2 samples to make an audible frequency.
That's not strictly true. An audio file with a single non-zero sample (usually set to full amplitude) is often used for testing -- usually called a Dirac impulse or similar.
That impulse will be (necessarily) band-passed by the playback hardware and put out filtered "white" noise.
That impulse can be recovered by a mic to show e.g., pre-ringing caused by (FIR) filters. An FFT of that impulse will show the playback hardware's response in the frequency domain vs full bandwidth.
> Thinking of the sample rate as a 'set frequency'
For e.g., a WAV file, that's a fixed number of samples per second (a frame being 1 sample x n channels). That is a set frequency, and deviating from it will alter the pitch of the music.
There really is no case where sample rate varies, unless we're talking about minute variations between the clock signals of different hardware, which requires the use of sample rate conversion to match.
A related concept is the bits-per-second of lossy formats (e.g., AAC) which may vary from frame to frame (and that frame will mean something different from a WAV frame).
> You need at least 2 samples to make an audible frequency.
I think you're confusing this with Nyquist being 1/2 of the sampling frequency. You can very much capture an audible signal with a single sample, but that signal will be limited (by hardware, by Nyquist, etc).
[Edit]
I should say that this single sample has to be non-zero and the playback system has to have a DC-offset that isn't equal to that sample's amplitude.
> That's not strictly true. An audio file with a single non-zero sample (usually set to full amplitude) is often used for testing -- usually called a Dirac impulse or similar.
Is this a PCM type sample or a frequency-domain sample? If the former, how frequently does this impulse get repeated in order to turn into white noise after going through the playback hardware? It sounds like if it's not repeated it should just make a nasty 'pop'.
> I should say that this single sample has to be non-zero and the playback system has to have a DC-offset that isn't equal to that sample's amplitude.
As I understand it, if you try to play a PCM audio file with a uniform value, you're effectively putting DC through the speakers, driving them to a particular offset where they'll stay until the end of the track. Is that not the case?
I don't think your comment is in conflict with the parent comment.
Yes, a single sample file can produce a sound — a tiny impulse spike "tick" — but that sound doesn't have any audible frequency or pitch because there's no oscillation or tone.
I think my confusion is around the article's 'frames' and your use of a FFT 'window' - again, what size should be used for the frame/window?
If you have just two samples of audio at 40khz, I understand that the max frequency you can capture is 20khz. But, given just two samples, I can't see how you can separate out the multiple lower frequencies that could be present. To do so, you would need more samples (i.e. for a longer period of time, not a higher frequency of samples). So my question is, how many samples do you pick to do the FFT on?
Oh, I see what you're asking now. This might get a bit mathy, so I won't be offended if I get something wrong and someone corrects me :)
I think where you're lost is that the FFT sample size determines the number (and therefore size) of the bins, while nyquist determines the maximum frequency that can be binned.
If you have a window of 8192 samples, for example, that gets you 4096 bins. If you're sampling at 44100hz, those 4096 bins gets you bin sizes of around ~5.38Hz per bin. So your lowest frequency range that you can identify would be 0-5.38hz, then 5.38hz-10.76hz, etc, up until 22044hz-22050hz. So if you're sliding a window of 8192 samples with a sample rate of 44100, that's about 18.5ms per FFT window. That means you can draw a spectrum with a resolution 4096 bins per 18.5ms of audio. If you wanted to plot a spectrum with a finer time resolution, you'll have to give up some frequency resolution by having smaller bins.
The actual results vary by FFT implementation, and it's common to have windowing inside the FFT that I don't really understand, but has something to do with the accuracy of those bins by preventing leakage of one bin into another. "Hamming Window" is probably the most common and usually happens by default, but that's a different window than the window you're sliding through the time domain to take FFTs on and plot.
As a user of FFT, at least for audio stuff (RF might be different), you mostly just think in terms of bin size.
> If you're sampling at 44100hz, those 4096 bins gets you bin sizes of around ~10.76Hz per bin. So your lowest frequency range that you can identify would be 0-10.76hz, then 10.76hz-21.52hz, etc, up until 44062hz-44073hz.
I was going to say that the bins are distributed logarithmically, so they aren't uniformly-sized like this. But I did some research and I guess I was wrong and they are uniformly sized, so I learned something.
> it's common to have windowing inside the FFT that I don't really understand
The FFT turns a signal from the time domain to the frequency domain. To do that, the math assumes that the signal is unchanged for all time. In other words, it treats that chunk of samples you give it as looping infinitely backwards and forwards in time.
But the set of samples you gave it are a segment from a signal that does change over time. So when you loop it, you'll get discontinuities.
For example, let's say your signal is a single sine wave whose period is twelve samples:
But that loop introduces a sharp discontuity. The FFT doesn't realize that discontinuity is not part of the original signal, so it will go ahead and analyze it. In order to get a jump like that, you need a lot of high frequency impulses, so the analysis will give you all of these extra high frequency results that aren't part of the original signal but are merely artifacts of you chopping the signal into pieces.
Windowing basically fades outs the edges of each segment to reduce those discontinuities so that you don't get the artifacts in the results. There are a bunch of different ways to do it because they're all sort of hacks that balancing fixing bogus artifacts with not wanting to mask actual signal that happens to occur near the edge of the segment.
> I was going to say that the bins are distributed logarithmically, so they aren't uniformly-sized like this. But I did some research and I guess I was wrong and they are uniformly sized, so I learned something.
I thought that to until I looked it up. The reason is because we're used to seeing it drawn in logarithmic scale on spectrum analyzers.
Btw, those are some neat ascii graphs. What tool do you use for that?
Sometimes people in the audio world use "frame" to mean "set of PCM samples per channel". So a typical stereo audio signal at 44100hz would be 44100 "frames" per second, even though it's really 88200 samples per second. The author seems to be using it that way, but it's confusing because he's also talking about sliding a FFT window over by frames.
> If the raw data is just measures of amplitude at a set frequency
LPCM is literally amplitude samples at a rate. Thinking of the sample rate as a 'set frequency' will lead to confusion (even though it obviously is a frequency). When you're thinking of samples as sequential amplitudes, you're thinking in the "time domain". For oscillations of the signal, that's the called the "frequency domain". Fourier transform is how you convert from time domain to frequency domain.
> don't you have to pick an arbitrary time slice to examine (and so risk losing lower-frequency sounds)?
You need at least 2 samples to make an audible frequency. If you only had 1 sample, you wouldn't hear anything, because nothing would be moving. So at 44100hz of sampling frequency, you can capture 0hz to 22050hz of audio frequency. That's called the Nyquist frequency, and it's always half of the sample rate.