How to extract audio from a video

You have a video file and you want the audio from it. Maybe it’s a conference talk you recorded on your phone and you want a podcast version. Maybe you filmed a live performance and the audio is the part worth keeping. Maybe you’re pulling dialogue from a video clip for a transcription service that only accepts audio files. Whatever the reason, the process is the same: strip the video track and keep the sound.

This guide covers how audio extraction works, how to pick the right output format and quality settings, what to expect in terms of file size, and how to do it with MakeMySounds’s video-to-audio extractor.

What happens during extraction

A video file is a container — MP4, MOV, WebM, MKV — that holds separate streams for video and audio (and sometimes subtitles, chapters, or metadata). The audio stream is already encoded inside the container, typically as AAC in MP4/MOV files, Opus in WebM files, or various codecs in MKV files.

Extraction can work two ways. The fast path is stream copying: pull the existing audio stream out of the container without re-encoding it. This is lossless and nearly instant because no decoding or encoding happens — the audio bytes pass through unchanged. The catch is that the output format is whatever the video already contained. If the video has AAC audio and you want AAC output, stream copying works perfectly. If you want MP3 instead, the audio needs to be decoded and re-encoded into MP3, which takes longer and introduces a generation of lossy compression.

MakeMySounds’s extractor handles both cases. When you choose an output format that matches the source audio codec, the server stream-copies where possible. When you choose a different format, it re-encodes cleanly via ffmpeg.

Choosing an output format

The right format depends on what you plan to do with the audio.

MP3— The safe default. Plays everywhere: phones, cars, old media players, podcast apps, every browser. Good for spoken word, sharing with others, or uploading to platforms that expect MP3. Use 128 kbps for voice content, 256–320 kbps for music.
WAV— Uncompressed PCM audio. Choose this when you need to edit the audio afterward in a DAW (Audacity, Logic, Ableton, etc.) and want to avoid any lossy re-encoding. The files are large — roughly 10 MB per minute of stereo audio — but the quality is pristine.
FLAC— Lossless compression. Same quality as WAV but roughly half the file size. Good for archiving recordings you might edit later. Not universally supported on older devices, but all modern players handle it fine.
OGG Vorbis— Open-source lossy format. Similar quality to MP3 at the same bitrate, sometimes slightly better. Works well if your target is web playback or a platform that supports it, but MP3 is a safer bet for general compatibility.
AAC— The codec most MP4 videos already use for audio. If the source video is MP4 and you just want the audio track as-is, AAC extraction is the fastest option since it can stream copy without re-encoding. Also a solid lossy format in its own right, slightly more efficient than MP3 at the same bitrate.

Quality settings and bitrate

When you extract to a lossy format (MP3, OGG, AAC), you pick a bitrate that controls the quality-to-size tradeoff. The rules are the same as compressing any audio file:

64–96 kbps— Voice only. Thin and audibly compressed on music, but perfectly fine for speech: meetings, lectures, dictation, phone recordings.
128 kbps— Good enough for most purposes. Podcasts, interviews, casual music sharing. On typical earbuds or laptop speakers, most people won’t notice artifacts.
192–256 kbps— Clean enough for music on decent headphones. A safe choice when you’re not sure about the listener’s setup.
320 kbps— Maximum quality for MP3. Virtually transparent. Use this when extracting music from a concert recording or any video where audio quality matters.

For lossless formats (WAV, FLAC), there’s no bitrate choice because nothing gets discarded. The quality matches whatever was in the video file. If the source video had 128 kbps AAC, extracting to WAV gives you a bigger file but not better audio — those frequencies were already thrown away when the video was encoded. You only benefit from lossless extraction when the source audio is high quality to begin with.

File size expectations

Knowing roughly how big the output will be helps you pick the right settings before waiting for the extraction to finish.

For a 10-minute video:

MP3 at 128 kbps: ~9.4 MB
MP3 at 320 kbps: ~23.4 MB
AAC at 128 kbps: ~9.4 MB
FLAC: ~25–40 MB (depends on audio complexity)
WAV: ~100 MB

The formula for lossy formats is straightforward: bitrate in kbps × duration in seconds ÷ 8 = file size in kilobytes. A 60-minute lecture at 128 kbps MP3 comes out to about 56 MB. The same lecture as WAV would be around 600 MB.

Common use cases

Conference talks and lectures

You recorded a presentation on your phone or laptop webcam. The video is shaky and the slides are unreadable, but the speaker sounds fine. Extract as MP3 at 128 kbps mono. The file will be small enough to email and clear enough for every word. If you plan to edit the recording (trimming dead air, cutting Q&A), extract as WAV first, edit in Audacity or another DAW, then export to MP3 from there.

Live music recordings

Phone recordings of live shows, rehearsals, or jam sessions. The audio is the point — the video is just a phone propped against a water bottle. Extract at 320 kbps MP3 or, better yet, FLAC if you want to keep full quality for later editing. The original audio in the video is typically AAC at 128–256 kbps (phone cameras vary), so extracting at a higher lossy bitrate won’t improve quality — it just avoids adding more compression on top of what’s already there.

Transcription prep

Most transcription services (automated and human) accept audio but not video. Even when they do accept video, uploading a 2 GB video file when only the 50 MB audio track matters wastes time and bandwidth. Extract as MP3 at 128 kbps — that’s more than enough for speech recognition. Some services work better with WAV; if accuracy matters more than upload speed, use WAV.

Podcast episodes from video interviews

You recorded a video podcast and want to distribute the audio version separately. Extract as MP3 at 128 kbps stereo (or 96 kbps mono if it’s a single speaker and file size matters for your RSS feed). Then use the audio cutter to trim intros and outros, and the loudness normalizer to hit -16 LUFS for Apple Podcasts compliance.

Sample ripping

You found a sound effect or music clip in a video and want to use it in a project. Extract the full audio, then use the audio cutterto isolate the exact segment you need. For sample work, extract as WAV so you’re working with uncompressed audio that won’t accumulate artifacts through further editing.

Step-by-step: extracting audio with MakeMySounds

The video-to-audio extractor runs server-side via ffmpeg, so it handles virtually every video format.

Go to the video-to-audio page and drop your video file (MP4, MOV, WebM, or MKV).
Choose your target audio format. MP3 for broad compatibility, WAV or FLAC if you need lossless, AAC for compact files that match what most videos already contain.
If you picked a lossy format, select a bitrate. The default of 192 kbps is a good middle ground. Bump to 320 for music, drop to 128 for speech.
Click extract. The file uploads to the server, ffmpeg processes it, and the audio file downloads automatically. Processing time depends on the video length and whether re-encoding is needed — stream copying takes seconds, re-encoding a long video might take a minute.

One thing to note: this tool uploads the video to our server for processing. Video files can’t be reliably demuxed in the browser, so server-side processing is necessary here. The file is deleted after processing completes — it’s not stored.

Tips for better results

Check the source quality first.If the video was recorded on a phone at 128 kbps AAC, there’s no benefit to extracting at 320 kbps MP3. You’re just inflating the file size. Match or exceed the source quality, but don’t expect miracles from low-quality source material.
Extract lossless for editing, lossy for distribution. If you’re going to trim, normalize, or otherwise process the audio, extract as WAV first. Do all your edits, then export to MP3 as the final step. Each round of lossy encoding degrades the audio slightly, so you want to encode to the delivery format exactly once.
Trim after extracting, not before.Don’t try to trim the video first just to make the extraction faster. Extract the full audio, then use the cutter or server-side trimmer to cut it down. Audio files are much smaller than video, so trimming them is faster than trimming video.
Mono for single-speaker voice.If the audio is one person talking, mono at 64–128 kbps is half the file size of stereo with no perceptible quality difference. Most video recordings capture voice in effectively mono anyway.

When extraction isn’t enough

Sometimes the extracted audio needs further work. Background noise from the camera’s location, uneven volume levels, long silences before and after the content. MakeMySounds has tools for each of these:

Silence remover— strips dead air from the beginning, end, and middle of the recording.
Volume normalizer— brings the overall level to a consistent target without clipping.
Loudness normalizer— EBU R128 compliant loudness normalization for podcast and streaming platforms.
Fade tool— adds clean fade-ins and fade-outs to avoid abrupt starts and endings.

Extract first, then chain the tools you need. Starting from the extracted audio keeps each step simple and avoids re-processing the full video every time you want to adjust something.