Waveform Studio Workbench


Table of Contents

nGene Waveform Studio

Development Consultation

  1. Media Format and Codec Overview
  2. Meta Information Extraction (Audio and Video)
  3. Design and UX Improvements for Desktop

Analytical consultation for nWS v3.3.5 waveform processing (Written November 13, 2025)

Heart Sound Analysis with Audio-Only Data and Synthetic Recordings (Written November 14, 2025)


Script


Meta Information

Python Script for BPM & Tempo Extraction from Multiple M4A Files (Written May 18, 2025)

Python Script for BPM & Tempo Extraction from Multiple Media Files (Written June 21, 2025)


Mathematical Modes

Summing Audio Tracks in Logic Pro (Written May 31, 2025)

Digital waveform amplitude & bidirectional dynamics (Written May 31, 2025)

Perceptual loudness normalization for multitrack mixing (Written June 7, 2025)

Bit depth and sample rate in digital audio (Written June 7, 2025)

Logarithmic perception of pitch and loudness in human hearing (Written June 7, 2025)

The mathematical foundations of musical harmony (Written June 8, 2025)


Waveform Analysis of Sound Mikio Tohyama

[Chatper 2] Discrete sequences and their Fourier transform (Written January 25, 2026)



Guide to nGene Waveform Studio v 3.3.5

Topic Details
Purpose Two-column HTML5 studio for audio/video playback, live signal visualization, lightweight tempo analysis, and simple source–mixture experiments.
Pure vanilla JS; SVG-only waveforms; no frameworks or Canvas.
New in v 3.3.5 line (3.3.1–3.3.5): Trim (loop-based clip creation with auto-download), matrix-based Mix of the last two items, and ICA separation of stereo mixes into two mono sources.
Layout Left column: Player (seek/loop, playhead cursor, volume, speed, transport controls, playlist, uploads).
Right column: Trim · Mix · ICA toolbar, Tempo details panel, and Signal views (Overview, Mid, Micro, and band rows: Low/Mid/High).
File locations Place nws.html anywhere.
Primary playlist: ./playlist.json (same folder as nws.html).
Legacy/fallback playlist and tempo meta: optional sibling folder /media/ containing playlist.json and tempo_meta.json. Files should be world-readable (e.g., chmod 644 *).
Playlist On load, the player first attempts ./playlist.json (array of media entries, order preserved); if unavailable, a legacy /media/playlist.json is attempted.
Absent JSON → starts empty and awaits uploads (drag-&-drop or picker). Uploaded files are referenced via blob-URLs only (no disk writes).
Playlist ordering Each row contains a dedicated button that sends that item directly to the bottom of the playlist while preserving the order of all others.
The currently selected row remains highlighted; index bookkeeping is adjusted so that the audible selection is preserved when possible.
Trim Trim cuts the current loop range of the selected item into a new media item and appends it to the playlist, then immediately plays it.
Audio items: decoded into an AudioBuffer, sliced in the loop interval, given short fade-in/fade-out ramps, encoded as 16-bit PCM WAV, and added as a new playlist entry.
Video items: preferred path uses MediaRecorder on a captureStream() of the element over the loop range, targeting MP4 when supported and falling back to WebM; a pure audio WAV fallback is used when capturing A/V is not possible.
New in v 3.3.4–3.3.5: the trimmed clip is auto-downloaded using the same filename shown in the playlist (WAV or MP4/WebM), immediately after creation.
Mix (matrix A) Mix combines the last two playlist entries into a stereo mixture using a fixed 2×2 mixing matrix:
A = [[1, 1], [0.5, 2]], where rows index output channels (L,R) and columns index sources (S1,S2).
Processing: each source is downmixed to mono, linearly resampled to a common sample rate, then mixed by A with automatic peak-based scaling to avoid clipping.
Output: stereo WAV blob (L = mixture#1, R = mixture#2), auto-named as MixA_S1+S2_YYYYMMDDhhmmss.wav, appended to the playlist, and auto-selected for playback. Tempo metadata and overview are computed for the mix and stored under its filename.
ICA separation ICA operates on the currently selected stereo item (e.g., a Mix result).
Internals: 2×N mixtures are centered, whitened via a 2×2 symmetric eigendecomposition, then separated with a 2-component FastICA (tanh nonlinearity, symmetric decorrelation between components, Frobenius-norm convergence).
Output: two mono WAV signals (ICA_A_of_* and ICA_B_of_*), normalized with modest headroom and short fades, appended to the playlist as independent entries with their own tempo and overview metadata.
Decoding & fallback Primary decoding path: decodeAudioData on fetched/uploaded bytes. For playlist URLs, fetch is attempted first.
Fallback: full-length or range-limited capture via MediaElementSource → AudioWorklet (preferred) or ScriptProcessor, routed through a zero-gain node to keep the capture path inaudible. The muted property is never used in logic.
Tempo metadata If present, /media/tempo_meta.json (keyed by filename) provides BPM and auxiliary fields (confidence, beat period, half/double suggestions, textual tempo class), which are reflected both in the playlist badge and the Tempo details panel.
Otherwise, an internal estimator runs on decoded buffers or short capture segments, yielding approximate BPM and beat-period values sufficient for exploratory work.
Uploads Accessible uploader with ➕ Upload button and drag-&-drop support; the uploader itself is keyboard-focusable.
Typical formats: MP3, M4A, FLAC, WAV, OGG, AAC, and common video containers such as MP4, MOV, WebM, MKV, and AVI.
First-30-second cue Uploader border and hint text gently pulse every 2 s for the first 30 s after load, encouraging an initial user gesture that reliably resumes the AudioContext on modern browsers.
A–B Looping Seek bar shows cerulean A (“[”) and B (“]”) handles plus a thin ultramarine loop fill, always constrained within the gray full-track bar.
Clear restores full-length playback. During playback, when the playhead reaches B, it wraps to A (with a small tolerance) as long as the loop is active.
Playhead Current time is indicated by a vertical “I”; the center of that stroke corresponds to the true position. The playhead is draggable and is clamped within the current loop range.
Click-to-toggle video Single-click on the video element toggles play/pause; double-click toggles fullscreen. The central ⏸︎/▶︎ transport button remains synchronized with element state.
Autoplay The first playlist item may start automatically depending on browser autoplay policy. The AudioContext resumes on the first user interaction (click, drag, drop, or keyboard action) to ensure consistent audio routing.
Repeat Mode Repeat cycles between One (🔁 with “1”), All (🔁), and Off (⛔).
With an A–B loop active, playback wraps within the loop regardless of repeat mode. When the loop is cleared, Repeat = All advances across playlist items; Repeat = One replays the same item.
Controls ⏮︎ Prev • ⏸︎/▶︎ Toggle • ⏭︎ Next • 🔁/⛔ Repeat • ✖ Loop-Clear • ⛶ Fullscreen (video).
Seek & Time Smooth range input with live “elapsed / total” time label, draggable A–B handles, thin loop fill, and a precise “I”-shaped cursor.
Loop bounds constrain both seeking and continuous playback; a small, duration-dependent epsilon avoids stickiness at the upper boundary during wrap.
Volume 0–500 % via WebAudio GainNode (primary route, single audible path).
If WebAudio is unavailable, a graceful fallback uses native element volume (0–100 %). The design avoids double-routing and unintended parallel audio paths.
Speed 0.05× – 2.00× with − / + step buttons (0.01 increments) and a 1× reset button. The same playback rate is applied to both audio and video media elements.
Tempo details Tempo panel presents BPM (with confidence), beat period (ms), half/double candidates, tempo class (Slow/Moderate/Fast), and effective BPM at the current playback speed (BPM × rate).
The panel is visible whenever either file-based metadata or the internal estimator provides data for the selected item.
Overview (playlist.json-aware) Overview is a whole-file SVG representation built from min/max envelopes over fixed buckets. In v 3.3.5, an internal helper ensures that an Overview is generated for the currently selected item even when it comes from playlist.json loaded at startup (audio or video).
Once constructed, the same Overview supports both the main Overview view and the centered Micro view around the playhead.
Signal views Overview (entire file, absolute timebase, interactive loop brackets and cursor), Mid (live trailing window, default 8 s), and Micro (centered ±3 s around the playhead; falls back to trailing when no Overview is available).
Band rows (Low ≤~200 Hz, Mid ~200–2000 Hz, High ≥~2 kHz) use a simple one-pole filter bank per band and share the same trailing length as the Mid window, with distinct color-coded strokes for quick visual discrimination.
Live tap AudioWorklet-based collector (preferred) or ScriptProcessor fallback receives data from the shared MediaElementSource nodes via an inaudible zero-gain branch.
Envelope rings are filled at an effective rate of ~2 kHz and decimated to maintain responsiveness while limiting CPU load. Tap operations do not alter the audible signal.
Resizable wrapper Outer .wrapper uses resize:both; the default width is governed by --w (980 px), suitable for dual-column layouts on desktop screens.
The playlist panel is vertically resizable, allowing adaptation to longer track lists or small windows.
Accent colour Changing --accent (default #1e90ff) rebrands key UI elements, including buttons, sliders, pulse highlights, and active playlist rows, while preserving structural CSS.
Fullscreen The ⛶ button and the F key toggle fullscreen for video items only; audio items retain the compact layout. The output route is re-applied on fullscreen changes to maintain consistent gain behaviour.
Source-code reveal Embedded “Full Source Code” accordion shows the entire page’s HTML/JS/CSS, syntax-highlighted via Highlight.js, allowing inspection, copy-paste, and regression testing from a single file.
Namespace All logic resides inside a single IIFE; public surface is limited to instantiation of the WaveformStudio class against the #box container. CSS is scoped by class names to minimize interaction with surrounding pages or frameworks.
Notes & caveats Decoding and cross-origin fetching depend on server CORS configuration; when direct decoding fails, the capture-based fallback is used instead. Some exotic codecs or DRM-protected streams may remain unsupported.
Mixed, trimmed, and ICA-derived outputs are held as in-memory blobs and appear as playlist entries; only Trim explicitly triggers a download by default in v 3.3.5.

Guide to nGene Waveform Studio v 3.1.0

Topic Details
Purpose Two-column HTML5 studio for audio/video playback, live signal visualization, and lightweight tempo analysis. Pure vanilla JS; SVG-only waveforms; no frameworks or Canvas.
New in v 3.1.0: Mix button (right column) that combines the last two playlist items into a headroom-safe WAV and appends it to the playlist for immediate playback.
Layout Left column: Player (seek/loop, volume, speed, transport, playlist, uploads).
Right column: Mix toolbar, Tempo details panel, and Signal views (Overview, Mid, Micro, and band rows).
File locations Place nws.html anywhere.
Optional sibling folder /media/ for playlist.json and tempo_meta.json. Ensure readable permissions (e.g., chmod 644 *).
Playlist Optional /media/playlist.json — array of media paths (order preserved).
Absent JSON → starts empty and awaits uploads (drag-&-drop or picker). Uploaded files are referenced via blob-URLs (no disk writes).
Mix (new) Click Mix to combine the last two playlist entries (audio or the audio track of video).
Processing: OfflineAudioContext offline render; per-track gain = 0.5 for headroom; linear sum; length = max(duration).
Output: in-memory WAV blob, auto-named as Mix - A + B.wav, appended to the playlist, and auto-played. Status text reports progress or errors (e.g., CORS/decoding).
Decoding & fallback Primary: decodeAudioData on fetched/uploaded bytes.
Fallback: full-length capture via MediaElementSource → Worklet/ScriptProcessor (kept inaudible through a zero-gain node; no muted property used).
Tempo metadata If available, /media/tempo_meta.json (keyed by filename) populates BPM and related fields in the list and Tempo panel. When absent, a quick internal estimator computes approximate BPM/beat period from short decoded segments or short captures.
Uploads ➕ Upload button and drag-&-drop; keyboard focusable uploader. Uploaded audio/video formats commonly supported: MP3/M4A/FLAC/WAV and MP4/MOV/WEBM/MKV/AVI.
First-30-second cue Uploader border and hint gently pulse every 2 s for the first 30 s after load to encourage interaction (resumes AudioContext reliably).
A–B Looping Seek bar shows cerulean A (“[”) and B (“]”) handles and a thin ultramarine loop fill, always inside the gray full-track bar.
Clear restores full-length playback instantly.
Playhead Current time indicated by a vertical “I”; the center of the line is the true position. Draggable, clamped within the loop.
Click-to-toggle video Single-click on video toggles play/pause; double-click toggles fullscreen. The ⏸︎/▶︎ control remains synchronized.
Autoplay First item may start automatically (per browser policy). AudioContext resumes on first user gesture (click, drag, drop) for consistent sound.
Repeat Mode Cycles: One (🔁 with “1”) → All (🔁) → Off (⛔).
With a loop active, playback wraps to loop start. After clearing loop and with Repeat = All, playback advances to the next track.
Controls ⏮︎ Prev • ⏸︎/▶︎ Toggle • ⏭︎ Next • 🔁/⛔ Repeat • ✖ Loop-Clear • ⛶ Fullscreen (video).
Seek & Time Smooth range input with live “elapsed / total”, draggable A–B handles, thin loop fill, and precise “I” cursor. Loop bounds clamp seeking and playback, with edge-aware wrap to loop start.
Volume 0–500 % via WebAudio GainNode (primary route). Graceful fallback uses element volume (0–100 %) if WebAudio is unavailable. Single audible route is always maintained.
Speed 0.05× – 2.00× with − / + step buttons (0.01) and 1× reset. Applies to audio and video uniformly.
Tempo details BPM (with confidence), beat period (ms), half/double suggestions, tempo class (Slow/Moderate/Fast), and effective BPM at current speed. Panel appears when data are available (from metadata file or internal estimator).
Signal views Overview (whole file; absolute “now” marker), Mid (live trailing window, default 8 s), Micro (centered ±3 s around playhead; falls back to trailing if overview not ready), and Band rows (Low ≤~200 Hz, Mid ~200–2000 Hz, High ≥~2 kHz) with color-coded strokes. Window lengths selectable; ✖ clears live buffers.
Live tap AudioWorklet collector (preferred) or ScriptProcessor fallback feeds envelope rings at ~2 kHz sampling for responsive SVG paths. Capture remains inaudible through a zero-gain branch; no reliance on muted.
Resizable wrapper Outer .wrapper uses resize:both; default width from --w (980 px for two columns). Track list is vertically resizable.
Accent colour Adjust --accent (default #1e90ff) to rebrand buttons, sliders, uploader, and active highlights.
Fullscreen Dedicated ⛶ button and keyboard F toggle fullscreen for video items.
Source-code reveal Built-in “Full Source Code” accordion displays the whole page, syntax-highlighted via Highlight.js, for sharing and tests.
Namespace All logic is encapsulated in an IIFE; CSS classes are locally scoped. Safe to embed alongside other pages and scripts.
Notes & caveats Decoding and cross-origin fetching depend on server CORS policies; when decoding fails, the inaudible capture fallback is attempted. Mixed output is stored as an in-memory blob (download prompt is not issued automatically).

Guide to nGene Media Player v 2.6

Topic Details
Purpose Self-contained, resizable HTML5 media player for audio (MP3/M4A/FLAC/WAV) and video (MP4/MOV/WEBM/MKV/AVI). Pure vanilla JS—no frameworks.
New since v 2.6: vertical “I” playhead (center = true position), refined A–B loop visuals, hardened uploads/drag-&-drop, reliable play/pause with AudioContext resume.
File locations Place nmp.html anywhere.
Media files live in sibling /media/.
Ensure readable permissions, e.g., chmod 644 *.
Playlist Optional /media/playlist.json — array of media paths (order preserved). If absent, player starts empty and waits for user uploads.
Tempo metadata Player reads tempo_meta.json (keyed by filename) to show integer-rounded BPM beside each track and in the title line (e.g., “128 BPM”).
Uploads Upload button and drag-&-drop. Files are played via blob-URLs (no disk writes). The dashed uploader box is clickable and keyboard-focusable.
First-30-second attention cue Uploader border and hint softly pulse/glow every 2 s for the first 30 s after load.
A–B Looping Seek bar shows two cerulean brackets:
A handle “[” — loop start.
B handle “]” — loop end.
Ultramarine blue loop bar (thinner) fills the loop region and is always fully inside the gray full-length bar (entire track).
Clear resets loop to full-length instantly.
Playhead Current position is a vertical “I” line; its center is the true time point. It can be dragged, and is always clamped inside the blue loop bar.
Click-to-toggle video Click anywhere on the visible video to play/pause; ⏸︎/▶︎ stays in sync. Double-click toggles fullscreen.
Autoplay First item starts automatically (subject to browser policy). AudioContext is resumed on first user gesture (e.g., button, drag, drop) for reliable playback.
Repeat Mode Cycles: 🔂 One → 🔁 All → ⛔ Off.
With a loop active, playback wraps to the loop start. After you press ✖ to clear loop and Repeat = All, the player advances to the next track at end (not the same track).
Controls ⏮︎ Prev • ⏸︎/▶︎ Toggle • ⏭︎ Next • 🔂/🔁/⛔ Repeat • ✖ Loop-Clear • ⛶ Fullscreen (video).
Seek & Time Sleek seek bar with live “elapsed / total” timer, A–B handles, thin blue loop bar, and draggable “I” playhead.
Volume 0–200 % gain via WebAudio (gain node). Default is 33 %. If WebAudio is unavailable, falls back to element volume (0–100 %).
Speed 0.05× – 2.00× slider with − / + step buttons (0.01) and 1× reset. Applies to both audio and video.
Resizable wrapper Outer .wrapper uses resize:both; default width from --w (360 px). Track-list is vertically resizable.
Accent colour Edit --accent (default #1e90ff) to rebrand buttons, slider thumbs, uploader, and active track highlight.
Fullscreen Dedicated ⛶ button and keyboard F toggle fullscreen for video items.
Source-code reveal Built-in “Full Source Code” accordion shows the entire page, syntax-highlighted via Highlight.js (for easy sharing/tests).
Namespace All logic wrapped in an IIFE; CSS uses scoped class names. Safe to embed alongside other scripts and styles.

Guide to nGene Media Player v 2.4

TopicDetails
Purpose Self-contained, resizable HTML5 player for audio (MP3/M4A) and video (MP4/MOV/WEBM). Pure vanilla JS—no frameworks required.
New since v 1.8: tempo-aware track-list showing BPM (integer-rounded), auto-loading from tempo_meta.json; initial volume defaults to 17 % at page-load.
File locations Place nmp.html anywhere.
Media files live in a sibling /media/ folder.
Ensure readable permissions with chmod 644 *.
Playlist Optional /media/playlist.json—an array of paths (order preserved). If absent, the player simply waits for user uploads.
Tempo metadata Run extract_meta_from_media.py v 2.4 to generate tempo_meta.json (single integer-rounded bpm). Player displays it beside each track and in the title-bar as “### BPM”.
Uploads Upload button and drag-&-drop. Files become blob-URLs, so nothing is written to disk.
First-30-second attention cue Uploader border, hint-text and container gently pulse, glow and scale every 2 s for the first 30 s after page-load.
A-B Looping Seek-bar sports two cerulean “brackets”:
A handle “[” — left edge marks loop-start.
B handle “]” — right edge marks loop-end.
Drag to set; ultramarine bar fills the loop range. ✖ Clear button instantly resets the loop.
Click-to-toggle video Click anywhere on the visible video to play/pause; the ⏸︎/▶︎ button stays synchronised.
Autoplay The first track auto-starts; subsequent behaviour follows Repeat Mode.
Repeat Mode Begins at 🔂 One (loop current). Button cycles: 🔂 One → 🔁 All → 🔁 Off.
Controls ⏮︎ Prev • ⏸︎/▶︎ Toggle • ⏭︎ Next • Repeat — plus ✖ Loop-Clear beside the seek-bar.
Seek & Time Sleek seek-bar with live “elapsed / total” timer, integrated A-B loop handles and ultramarine fill.
Volume Smooth 0–100 % slider with live percentage label; initial default 17 % (0.17).
Speed 0.70× – 2.00× slider with − / + step buttons and 1× reset. Applies to audio & video.
Resizable wrapper Outer .wrapper uses resize:both; default width governed by --w (360 px). Track-list is vertically resizable.
Accent colour Edit --accent (default #1e90ff) to rebrand buttons, slider thumbs, active-track row and uploader pulse.
Source-code reveal Built-in “Full Source Code” accordion shows the entire page, syntax-highlighted via Highlight.js.
Namespace All logic wrapped in an IIFE; CSS uses local class names—safe to embed anywhere.

Guide to nGene Media Player v 1.8 (c)

TopicDetails
Purpose Self‑contained, resizable HTML5 player for audio (MP3/M4A) and video (MP4/MOV/WEBM). Pure vanilla JS—no frameworks.
New since v 1.6 (c): draggable cerulean‑blue “bracket” handles for precise A‑B looping, ultramarine loop‑fill, and click‑to‑toggle playback directly on the video surface.
File locations Place nmp.html anywhere.
Media files live in a sibling /media/ folder.
Ensure readable permissions with chmod 644 *.
Playlist Optional /media/playlist.json—an array of paths (order preserved). If absent, the player simply waits for user uploads.
Uploads ➕ Upload button and drag‑&‑drop. Files become blob‑URLs, so nothing is written to disk.
First‑30‑second attention cue Uploader border, hint‑text and container gently pulse, glow and scale every 2 s for the first 30 s after page‑load.
A‑B Looping (1.8 series) Seek‑bar sports two cerulean “brackets”:
A handle “[” — left edge marks loop‑start.
B handle “]” — right edge marks loop‑end.
Drag to set; ultramarine bar fills the loop range. ✖ Clear button instantly resets the loop.
Click‑to‑toggle video Click anywhere on the visible video to play/pause; the ⏸︎/▶︎ button stays synchronised.
Autoplay The first track auto‑starts; subsequent behaviour follows Repeat Mode.
Repeat Mode (default) Begins at 🔂 One (loop current). Button cycles: 🔂 One → 🔁 All → 🔁 Off.
Controls ⏮︎ Prev • ⏸︎/▶︎ Toggle • ⏭︎ Next • Repeat — plus ✖ Loop‑Clear beside the seek‑bar.
Seek & Time Sleek seek‑bar with live “elapsed / total” timer. Integrates A‑B loop handles and ultramarine fill described above.
Volume Smooth 0–100 % slider with live percentage label.
Resizable wrapper Outer .wrapper uses resize:both; default width governed by --w (360 px). Track‑list is vertically resizable.
Accent colour Edit --accent (default #1e90ff) to rebrand buttons, slider thumbs, active‑track row and uploader pulse.
Source‑code reveal Built‑in “Full Source Code” accordion shows the entire page, syntax‑highlighted via Highlight.js.
Namespace All logic wrapped in an IIFE; CSS uses local class names—safe to embed anywhere.

Media Format and Codec Overview

Modern media players should support a variety of audio and video file formats. Below is an overview of commonly used formats, including their typical use cases, compatibility considerations, licensing issues, technical notes, and recommendations for use. Emphasis is placed on desktop and HTML5/JavaScript environments.

Common Audio Formats

MP3 (MPEG Audio Layer III)

AAC / M4A (Advanced Audio Coding)

Ogg Vorbis (and Opus)

FLAC (Free Lossless Audio Codec)

WAV (Waveform Audio File Format / PCM)

Common Video Formats

MP4 (H.264 Video in MP4 Container)

WebM (VP8/VP9 Video in WebM Container)

AV1 (Next-Generation Open Video Codec)

MKV (Matroska Video Container)

AVI (Audio Video Interleave)

MOV (QuickTime File Format)

Recommended Default Formats: Considering the above, for broadest compatibility and ease of use in a web-based desktop player, the recommended default formats are MP3 for audio and MP4 (H.264/AAC) for video. These two cover nearly all browsers and platforms with no special setup. In practice, this means the player should primarily handle MP3 for music and MP4 for video. However, to make nGene Media Player more robust and appealing, it should also support the common alternatives: including AAC (M4A) ensures high-quality audio support, Ogg Vorbis/Opus provides open-format options, and FLAC allows for lossless audio playback. On the video side, adding support for WebM (VP8/VP9) is advisable for modern browsers, and being mindful of AV1 will keep the player up-to-date with emerging standards. Less common or legacy formats like MKV, AVI, and MOV can be acknowledged, but the strategy should be to handle them via conversion or not at all, rather than as primary supported formats. By focusing on MP3 and MP4 as the core, and supplementing with the next tier of formats, the player will cater to most use cases while maintaining reliability.

Written on March 9, 2025


Meta Information Extraction (Audio and Video)

A media player like nGene Media Player not only plays audio and video but often also presents information about the media to the user. This includes basic details (duration, title) and possibly more advanced metadata (like album name, video resolution, etc.). Below, we outline what metadata can be obtained from media files and discuss methods to extract this information using web technologies (JavaScript in the browser) and Python (which could be used server-side or via PyScript in-browser). We also provide guidance on when to use client-side vs. server-side (or local) analysis based on the depth of metadata required.

Types of Media Metadata

Most of the above metadata can be accessed or computed with the right tools. The next sections describe how to retrieve these details using JavaScript in the browser and using Python, respectively.

Client-Side JavaScript Methods

In a purely browser-based environment (vanilla JavaScript), one can extract a subset of the above information. The HTML5 media elements and additional libraries are the primary means to do so:

Using the above methods, a web-based media player can gather a wealth of information without leaving the browser. For instance, on loading a file, the player could immediately display the duration via the duration property, show the title/artist by parsing tags with music-metadata, show the resolution via videoWidth/Height , and perhaps generate a waveform preview using Web Audio – all done client-side. The main constraints are performance (very large files or very detailed analysis can be slow) and the necessity to include libraries or WASM modules (increasing app size). When extremely detailed info or heavy computation is needed, one might then consider Python or server-side tools, as described next.

Python and PyScript Approaches

Python has a rich ecosystem for media processing, and it can be used in two ways: on a backend server (or a local machine, outside the browser) to preprocess or analyze media, or via PyScript/WebAssembly to run Python code in the browser. Here we outline how Python libraries can extract metadata and do deeper analysis, and how that might fit into the architecture of the media player.

Architectural Considerations

When implementing metadata extraction in nGene Media Player, it’s important to choose the right tool for the job to provide a good user experience without unnecessary overhead. Here are some guidelines on when to use client-side JS vs. Python/back-end solutions:

In conclusion, the strategy for metadata should match the needs of the user base and the resources available. For a relatively small-scale or personal project, sticking to client-side solutions keeps things simple and respects user privacy. For a larger-scale application with many users and files, investing in backend services for richer metadata could greatly enhance the user experience. nGene Media Player can start by extracting what’s easy (duration, basic tags via JS) and progressively incorporate more advanced metadata features using Python tools as needed, ensuring that the architecture remains flexible for such upgrades.

Written on March 9, 2025


Design and UX Improvements for Desktop

With the functionality in place, attention turns to improving the user interface and experience of nGene Media Player. A desktop-focused web media player should leverage the larger screen and input options (mouse, keyboard) to provide an engaging and efficient experience. Below are suggestions for design and UX enhancements, organized into layout/visual improvements, interaction improvements, and the use of modern libraries to add polish. The tone of these suggestions is to enhance usability and aesthetics in a professional, subtle way without overwhelming the user.

Enhanced Layout and Visualizations

Improved User Interaction

Modern UI Libraries and Frameworks

By implementing these design and UX improvements, nGene Media Player will not only be functionally robust but also user-friendly and visually appealing. It will feel like a modern desktop application, with responsive controls, rich visuals like waveforms, and thoughtful details (like shortcuts and drag-drop) that desktop users appreciate. The use of web technologies and libraries means the player can achieve a high level of polish comparable to native apps, while remaining customizable and lightweight. As always, incremental enhancement is wise: features can be added step by step, gathering user feedback to refine the UX. Over time, these improvements can significantly elevate the user’s enjoyment and efficiency when using the media player, fulfilling the goal of a comprehensive and professional media playback experience.

Written on May 9, 2025


Analytical consultation for nWS v3.3.5 waveform processing (Written November 13, 2025)

Fourier Transformation for Waveform Analysis

The Fourier Transform is a fundamental tool that converts a time-domain signal into a frequency-domain representation. In essence, it decomposes a waveform into a sum of sinusoidal components of various frequencies. Mathematically, for a continuous signal \(x(t)\), the Fourier transform \(X(f)\) is defined by an integral that sums \(x(t)\) against complex exponentials \(e^{-j 2\pi f t}\) across time. This operation produces a complex function \(X(f)\) indicating the amplitude and phase of each frequency component present in the original signal. In the context of digital audio (with discrete samples), one uses the discrete Fourier transform (DFT), which similarly expresses a finite sequence as a combination of sinusoidal basis functions.

By revealing the frequency content of a waveform, the Fourier transform provides insights that are difficult to obtain from raw time-domain data. In audio analysis scripts, applying a Fourier transform enables spectral visualization– for example, generating a frequency spectrum or spectrogram that shows how energy is distributed across frequencies (and over time, in the case of a spectrogram). The frequency-domain view makes it easy to identify prominent frequency components: one can readily spot the dominant pitch (fundamental frequency) of a sound and its harmonics, or recognize different sound sources by their distinct spectral patterns.

Fourier analysis also aids in segmentation and feature extraction. Different sections of an audio signal (such as phonemes in speech or notes in music) often exhibit distinct frequency profiles; thus, a script can detect transitions or segment the waveform by looking for changes in the spectrum. Moreover, many audio features and processing techniques are based on the Fourier transform. For instance, one can filter out unwanted noise by zeroing out specific frequency bands in the spectrum, or compute descriptive metrics like the spectral centroid (the “center of mass” of the spectrum) and spectral bandwidth. In summary, the Fourier transform is a cornerstone of waveform analysis, transforming complex time-domain data into a form that is more amenable to visualization, measurement, and algorithmic manipulation.

Fourier Transform vs. Fast Fourier Transform (FFT)

While the term Fourier Transform refers broadly to the mathematical conversion between time-domain and frequency-domain representations, the Fast Fourier Transform (FFT) is a specific efficient algorithm for computing the Fourier transform (particularly the DFT) in practice. The FFT leverages symmetries in the calculation to greatly speed up the transformation. The comparison below highlights key differences and roles of each:

Aspect Fourier Transform (FT) Fast Fourier Transform (FFT)
Definition A general mathematical transform mapping a signal from the time domain to the frequency domain. Can be formulated as an integral (continuous case) or a summation (DFT for discrete signals). An algorithm (family of algorithms) to compute the discrete Fourier transform rapidly. It gives the same result as the DFT but far more efficiently.
Computation Conceptually involves integrating or summing over all time samples with complex exponentials. Direct computation of an N-point DFT has complexity on the order of O(N 2 ). Uses a divide-and-conquer approach (e.g. the Cooley-Tukey algorithm) to reduce computational workload. Achieves roughly O(N log N) complexity, which is substantially faster for large N.
Usage Provides the theoretical foundation for frequency analysis; used in analytical derivations and definitions (e.g. defining the spectrum of a signal). Used for practical computation in software and scripts. In almost all real applications (audio analysis, signal processing), one calls an FFT routine to obtain the frequency spectrum of a dataset.

Practical note: In scripting and signal processing work, the FFT is the de facto method to perform Fourier analysis on data. One rarely computes a Fourier transform “by hand” except for theoretical work; instead, built-in FFT functions efficiently yield the frequency-domain data. Both FT and FFT produce the same kind of output (frequency-domain representation), but the FFT makes it feasible to analyze long signals and even to do real-time spectral processing thanks to its speed.

Fundamental Attributes of Audio Waveforms

Sound waves have several measurable properties that correspond to how we perceive sound. A simple sinusoidal waveform can be expressed as \(x(t) = A \sin(2\pi f t + \phi)\), where \(A\) is the amplitude, \(f\) is the frequency, and \(\phi\) is the phase. These physical parameters relate directly to key auditory attributes: amplitude corresponds to perceived loudness, frequency corresponds to perceived pitch, and phase influences the waveform’s alignment (which can affect how waves interfere or combine). Real-world sounds are usually not single pure tones, but combinations of many frequency components; this gives rise to additional characteristics like timbre(the quality of sound that distinguishes different sources or instruments) and the amplitude envelope(how a sound’s loudness changes over time). Below, several fundamental waveform attributes are described:

Advanced Analytical Techniques in Signal Processing

Beyond the basic Fourier transform and the attributes of waveforms, there are several advanced techniques that can further assist in analyzing and processing audio signals. These methods either provide more detailed time-frequency information or apply statistical decomposition to extract meaningful components from complex data. Key techniques include the following:

Each of the above techniques offers unique benefits for audio processing. Time-frequency methods like STFT and wavelet transforms allow detailed examination of when certain frequencies occur, addressing limitations of a plain Fourier transform for non-stationary signals. Statistical methods like PCA and ICA enable the extraction of patterns or sources from multivariate data, which is valuable when dealing with complex mixtures or reducing data dimensionality. Other specialized analyses such as cepstral processing and NMF target specific types of structure (periodicity in spectrum, or additive parts of a mixture) that are not immediately apparent from a basic FFT. By combining these approaches – Fourier-based transforms for spectral content, wavelets for multi-scale timing, and component analysis for pattern separation – an audio analysis script can be significantly enhanced, yielding richer insights and more powerful processing capabilities.

Written on November 13, 2025


Heart Sound Analysis with Audio-Only Data and Synthetic Recordings (Written November 14, 2025)

Heart sound analysis is the study of the audible noises produced by the heart (the phonocardiogram (PCG)) to detect health conditions or even identify individuals. Traditionally, doctors use a stethoscope to listen to heart sounds for diagnosing murmurs, valve problems, or other cardiac issues. With modern technology, these sounds can be recorded as digital audio, enabling computerized analysis using signal processing and deep learning. Focusing on audio-only data (without additional signals like ECG or imaging) is a practical approach, especially since heart sounds alone carry rich information about cardiac function. Below, we discuss the sources of heart sound recordings, challenges in using them, and how data augmentation and synthetic recordings (including simulator-based audio) are improving heart sound analysis.

I. Heart Sound Datasets and Audio-Only Recordings

Collecting real heart sound recordings is the first step for any audio-based analysis. Heart sounds are typically recorded using electronic stethoscopes or microphones placed on the chest. Over the years, several datasets of these audio-only heart recordings have been compiled for research and education:

  1. Educational Libraries:

    For example, the Heart Sound and Murmur Library (University of Michigan, 2015) is an open collection of stethoscope recordings. It contains examples of normal heartbeats and various murmurs. Such libraries are relatively small (a few dozen recordings) and meant for teaching, but they provide clear samples of different heart sound types.

  2. PhysioNet/CinC Challenge Dataset (2016):

    A large public dataset assembled for a heart sound classification challenge. It comprises thousands of PCG recordings collected from multiple sources and countries. The recordings include both normal and abnormal heart sounds (murmurs, etc.), captured with different devices in varied environments. This diversity makes it valuable for training models, though it also introduces noise and heterogeneity.

  3. CirCor DigiScope Phonocardiogram Dataset (2022):

    One of the largest heart sound datasets to date, with over 5,000 recordings, focused on pediatric patients. It was created for a recent PhysioNet challenge on murmur detection. Importantly, this dataset provides multiple recording spots per patient (various chest locations) and includes labels for murmurs. Being a big audio-only collection, it supports deep learning models that require lots of data.

  4. Other datasets:

    Researchers have also used smaller collections from hospitals or labs. Some include specific conditions (e.g., only certain valve diseases) or specific populations. The general trend is that purely audio heart datasets are much smaller than, say, image datasets in other domains, due to the effort needed to record and label each patient's heart sounds.

All these recordings are pure sound (PCG) data. They capture the lub-dub of heartbeats and any extra sounds (murmurs, clicks) but no additional signals. Working with audio-only data is appealing because recording audio is non-invasive and simple compared to imaging or other tests. However, relying on sound alone means the analysis must overcome some challenges inherent to audio data, as discussed next.

II. Challenges with Real Heart Sound Data

Using only real heart sound recordings for automated analysis comes with several challenges:

  1. Limited Data Volume:

    Compared to fields like image or speech recognition, heart sound datasets are quite limited in size. Collecting heart audio requires clinical access and expertise (for labeling what is normal vs abnormal). Privacy and consent issues also limit sharing patient data. As a result, researchers often have only a few thousand recordings or less, which can be insufficient for training complex deep learning models.

  2. Class Imbalance:

    In many heart sound datasets, normal recordings far outnumber abnormal ones. For example, there are many recordings of healthy heartbeats, but relatively fewer examples of rare murmurs or conditions. This imbalance makes it hard for a model to learn the subtleties of abnormalities – it might simply learn to always predict "normal". The model’s performance on detecting actual pathological cases can suffer as a result.

  3. Noise and Variability:

    Heart audio recorded in real-life settings often contains noise. There can be background sounds (hospital room noise, stethoscope friction, patient movement) and other body sounds (lung sounds overlapping the heart sounds). Additionally, different stethoscope devices and placement sites produce variations in sound quality and frequency content. This high variability means a model trained on one dataset might not perform well on another if the noise profiles differ. It’s a challenge to make models robust to these differences using limited real data.

  4. Annotation Difficulty:

    Determining the ground truth (what exactly the heart sound signifies) often requires expert listening. Labeling a murmur or diagnosing a condition from sound is sometimes subjective and error-prone. So, real datasets may have label noise or inconsistencies. For tasks like biometric identification using heart sounds, labeling who the sound belongs to is easier, but such use-cases are less common and still experimental.

Because of these challenges, researchers seek ways to enhance and expand the available audio data without having to gather countless new patient recordings. This is where data augmentation and synthetic data generation become crucial.

III. Augmentation of Heart Sound Recordings

Data augmentation refers to taking existing real recordings and modifying them in various ways to create "new" training examples. The key idea is to expand the dataset artificially and introduce variations that improve a model’s generalization. For heart sound (audio) data, common augmentation techniques include:

  1. Adding Noise:

    Overlaying recordings with additional noise can help a model learn to focus on the relevant heart sound patterns and become noise-tolerant. For instance, one can add white noise, ambient hospital sounds, or respiratory noises at various levels to a clean heartbeat recording. This teaches the model to handle different signal-to-noise scenarios.

  2. Time Stretching/Compressing:

    Slightly changing the speed of the audio without altering pitch can simulate different heart rates. A recording can be time-stretched to sound a bit slower or faster (within realistic limits) which is like having the patient’s heart beating at a different rate. This augmentation helps the model cope with heart rate variability.

  3. Pitch Shifting (Frequency Scaling):

    Although heart sounds don’t exactly have a “pitch” like music, one can alter the frequency content a bit – for example, simulating the effect of different stethoscope frequency responses or chest anatomy. A mild pitch shift can make the sound a bit higher or lower in frequency, which may help the model to not be overly tuned to one particular frequency profile.

  4. Splitting and Combining:

    Long heart sound recordings can be split into shorter segments (which provides more training samples). Conversely, one might concatenate beats from different recordings to create a new sequence. This can be tricky for preserving realism, but sometimes mixing segments helps ensure the model sees a variety of beat patterns.

  5. Random Volume and Filtering:

    Changing the volume (amplitude) simulates varying auscultation pressure or device gain. Applying filters (like bass boost or treble cut) can mimic using different stethoscope hardware. These augmentations ensure the model doesn’t get thrown off by recordings that are louder, quieter, or slightly filtered relative to the training data.

By augmenting the available heart sound recordings in these ways, researchers can greatly increase the number of training examples and the diversity of conditions. For example, a dataset of a few hundred real recordings can be expanded to thousands of augmented samples by applying combinations of these techniques. This has been shown to improve performance; the model learns to recognize the underlying heart sound patterns (normal or abnormal) under various noise and distortion conditions, rather than overfitting to the exact original recordings.

However, augmentation can only produce variations of what already exists in the data. It doesn’t create entirely new heart sound events that were never recorded. For generating completely new heart sound samples (especially of rare conditions), researchers turn to synthetic data generation.

IV. Synthetic Heart Sound Generation

Synthetic generation involves creating artificial heart sound signals that imitate real ones. Unlike simple augmentation (which modifies real recordings), synthetic data can provide brand-new examples, potentially including pathological patterns that are under-represented in real data. Several approaches have emerged for synthesizing heart sounds:

  1. Physiological Signal Models:

    Earlier attempts used mathematical models of the heart’s mechanics and blood flow to synthesize phonocardiograms. For instance, one can model the heart valves opening/closing and generate corresponding sound waves. These models could produce basic normal heartbeat sounds and some murmur-like effects by altering parameters (like simulating a leaky valve). While insightful, purely mathematical models often struggle to capture the full complexity and natural variability of real heart sounds.

  2. Generative Adversarial Networks (GANs):

    In recent years, GANs have been applied to heart sound data. A GAN is a deep learning model with two parts (generator and discriminator) that can learn to create realistic fake samples. Researchers have trained GANs on collections of real heart sounds so that the generator can output new audio waveforms that sound like heartbeats. One notable use-case is generating abnormal heart sounds (e.g., murmurs indicative of disease) because these are less common in datasets. By creating synthetic abnormal samples, the training set can be balanced. Studies have shown that using GAN-generated heart sounds as additional training data improves a model’s ability to detect cardiac abnormalities. The synthetic sounds, if high-quality, can introduce subtle variations of murmurs that the model might not see in the limited real dataset. Progressive GAN architectures have been reported to produce fairly realistic heart cycles, and when classifiers are trained on a mix of real and GAN-generated data, their accuracy on detecting conditions improved compared to training on real data alone.

  3. Diffusion Models and Other Deep Generators:

    Beyond GANs, new generative frameworks like diffusion probabilistic models have been explored for heart sound synthesis. Diffusion models gradually add and remove noise to/from data in a learning process, and they have achieved excellent fidelity in audio generation (they are used in some speech synthesis tasks). Researchers have begun applying these to heart sounds, sometimes in creative ways – for example, generating a heart sound conditioned on an ECG signal. In one recent approach, a diffusion model was used to create artificial heart sound waves (PCG) from corresponding ECG recordings. This effectively augments existing ECG datasets with synthetic heart sound data. Even without conditioning on ECG, diffusion models can be trained to generate heart sound clips that are hard to distinguish from real stethoscope recordings. The key advantage of these advanced generative models is the quality of synthetic output: they can capture the timing and timbre of real heartbeats, including subtle murmurs or extra sounds, more convincingly than older methods.

  4. Variational Autoencoders (VAEs) and Others:

    VAEs and similar generative networks have also been tried for creating heart sound spectrograms or waveforms. These tend to produce slightly blurrier outputs compared to GANs or diffusion, but can still add variety to the dataset.

Synthetic heart sounds generated by these methods can significantly increase the training data, especially for rare conditions. For example, if the real dataset has only a handful of recordings of a particular murmur type, a GAN or diffusion model trained on them might produce dozens of plausible new examples of that murmur. These can then be added to training. It is crucial, however, that synthetic sounds are realistic. Poor-quality synthetic data might contain artifacts or unrealistic patterns that could confuse the model. Therefore, researchers usually validate synthetic samples (e.g., have experts or algorithms check that they resemble real heartbeats) before trusting them for model training.

V. Simulator-Based Heart Sound Recordings

Another source of augmented audio-only data is using clinical simulators or manikins. Medical training manikins often have built-in speakers and software that can emulate heart and lung sounds for different conditions. These simulator-based recordings occupy a middle ground between real and fully synthetic data:

  1. Manikin Recordings:

    A digital stethoscope can be placed on a training manikin (or a specialized simulator device) which is programmed to play a specific heart sound scenario (such as a murmur of a certain type, or a normal heart with a particular rate). The resulting recording is an audio file that is technically "real" in the sense that it was recorded through a stethoscope, but the source of the sound is an artificial simulation. One publicly available dataset, for instance, includes over 500 recordings from a clinical manikin, covering various normal and abnormal heart and lung sounds. These are useful because the exact diagnosis or condition for each recording is known (since the scenario was programmed). They also allow repetition – researchers can generate as many recordings as needed of a certain condition by replaying it or adjusting the simulator.

  2. Consistency and Variation:

    Simulator-based sounds are consistent (which is good for focused training data on a specific condition) but can lack some variation present in real patients. For example, a manikin’s “aortic stenosis murmur” might always have the same character, whereas real patients with the same condition could have slight differences in their murmur sounds due to anatomy or comorbidities. Therefore, while manikin recordings enhance data volume and provide ground-truth labels, they may not capture the full diversity of real heart sound presentations.

  3. Augmenting Simulated Sounds:

    Interestingly, one can also apply the earlier augmentation techniques to simulator recordings. For instance, taking a clear manikin-generated murmur sound and adding noise or slight filtering could make it more realistic. In this way, simulator data can serve as a base which is then diversified through augmentation.

Simulator-based recordings are especially valuable for training and initial algorithm development. They ensure that at least the algorithm has heard examples of the condition it’s supposed to detect. Later on, fine-tuning with real patient recordings can adjust the model to real-world idiosyncrasies. Overall, simulators provide a safe, repeatable, and cost-effective way to get more heart sound data without needing to find numerous patients with each condition.

VI. Benefits of Augmented and Synthetic Data in Heart Sound Analysis

Incorporating augmented and synthetic heart sound recordings has shown clear benefits for machine learning models:

  1. Improved Accuracy:

    By training on a larger and more diverse dataset (real + augmented + synthetic), models generalize better. Studies have reported that classifiers for detecting abnormal heart sounds achieved higher accuracy when rare abnormal examples were bolstered with synthetic instances. Even modest gains in accuracy can be significant in a clinical context – for example, catching a few more cases of disease that might have been missed.

  2. Better Generalization and Robustness:

    Perhaps the biggest advantage is improved robustness. A model trained on varied data (different noises, different simulated conditions) is less likely to be thrown off by a slightly different recording. In fact, experiments have shown that when a model is tested on an entirely new dataset (from a different hospital or recorded with a different device), those trained with extensive augmentation/synthesis maintain performance much better. One report noted dramatic improvements in cross-dataset evaluation: a classifier trained with synthetic augmented data saw its performance on an external test set jump considerably (indicating it wasn’t overfit to the quirks of the original training set). This robustness is crucial for real-world deployment, where a heart sound AI might encounter sounds from many environments.

  3. Addressing Imbalance:

    Synthetic generation specifically helps address the class imbalance problem. By generating more samples of under-represented classes (e.g. various murmur types, heart defect sounds), the training data becomes more balanced. A model trained on a balanced set is less biased and more sensitive to detecting those abnormal cases. In practical terms, this means fewer false negatives (missing a pathology) because the model had plenty of examples to learn what that pathology sounds like.

  4. Enabling New Applications:

    With more data available through augmentation, researchers have begun exploring ambitious applications like heart sound biometric identification (using a person’s unique heart sound as an ID). This is a challenging task because each recording can vary with conditions, but having lots of audio data (including simulated variations of an individual’s heart sound) could help algorithms discern person-specific patterns. Augmented data also supports training deep neural networks for tasks like segmentation (finding exact timing of heartbeats) and multi-condition classification (distinguishing between different murmur types), where large datasets are needed for the model to learn fine-grained differences.

  5. Rapid Experimentation:

    Another benefit is the ability to try out scenarios that are rare in reality. For instance, if one wants to test an algorithm’s ability to detect an extremely rare heart defect, creating a synthetic version of that defect sound and inserting it into various backgrounds can allow preliminary testing of the model’s sensitivity. This way, researchers aren't entirely constrained by what they can collect in clinics.

It’s worth noting that while augmented and synthetic data improve models, they must be used carefully. If the synthetic data is too artificial or if augmentation is overdone (creating sounds that no longer resemble real physiological signals), models might learn wrong patterns. The best practice is to combine real and synthetic data and validate the model extensively on real-world recordings to ensure it performs as intended.

VII. Conclusion

In summary, audio-only heart sound recordings are a powerful resource for non-invasive cardiac diagnosis and potentially for biometric identification. Numerous datasets of heart sounds have been gathered, but they are often limited in size and scope. By focusing on sound alone, one avoids the complexity of additional sensors, but this places more importance on having rich and sufficient audio data. Data augmentation techniques have become a standard tool to enrich heart sound datasets, introducing variability in noise, timing, and frequency that help machine learning models learn robust features. Beyond that, synthetic heart sound generation – through advanced AI models or simulator-based recordings – has opened new avenues to significantly expand the training data with realistic examples of normal and pathological heart sounds. These approaches help overcome the challenges of data scarcity and imbalance, leading to models with higher accuracy and better generalization to real-world conditions.

The combination of real heart recordings with augmented and synthetic data is enabling more reliable heart sound analysis systems. Researchers have demonstrated that this approach can improve detection of abnormalities (like murmurs) and make the algorithms more resilient to variations between different hospitals or recording devices. Looking forward, as generative models continue to improve, we can expect even more lifelike synthetic heart sounds to augment datasets. This will further reduce the dependency on large-scale clinical data collection and allow rapid development of heart sound AI tools. In essence, using sound-only data, enhanced with creative augmentation and synthetic generation, is a promising strategy to advance digital stethoscope applications – helping screen for heart conditions accurately and possibly verifying identity through the subtle acoustics of the heart. This audio-focused approach maintains the simplicity and non-invasiveness of the stethoscope while leveraging modern computational techniques to extract as much information as possible from the heartbeat sound.

Written on November 14, 2025


Script


Meta Information


Python Script for BPM & Tempo Extraction from Multiple M4A Files (Written May 18, 2025)

This document describes extract_meta_from_media.py (v1.1), an enhanced Python script that computes the global BPM of every .m4a file in ~/Desktop/m4a and—new in this release—extracts tempo metadata and an instantaneous tempo curve for deeper musical analysis.

1. Objective

The script will:

  1. Locate all .m4a files in the m4a folder on your Desktop.
  2. For each file:
    • Estimate its global BPM with librosa.
    • Read any embedded BPM tag (iTunes “tmpo” atom).
    • Generate a frame-level tempo curve to reveal fluctuations over time.
  3. Print a clean report to the console for every track.

2. Prerequisites

  1. Python 3.8 + (macOS ships with an older Python—install a recent one via Homebrew if needed).
  2. Virtual-environment setup (recommended)
    Execute these commands from ~/Desktop:
    python3 -m venv venv
    source venv/bin/activate
    pip install --upgrade pip
  3. Libraries
    Install the three required packages inside the venv:
    pip install librosa mutagen numpy
    Optional but wise: librosa benefits from FFmpeg for broad codec support:
    brew install ffmpeg
  4. Folder structure
    Ensure your Desktop looks like:
    Desktop/
    ├── extract_meta_from_media.py
    └── m4a/
        ├── song1.m4a
        ├── song2.m4a
        └── …

3. Implementation

The complete v1.1 source code is reproduced below.

#!/usr/bin/env python3
"""
Filename  : extract_meta_from_media.py
Version   : 1.1
Author    : Hyunsuk Frank Roh

Description
-----------
Walk through ~/Desktop/m4a, estimate the *global* BPM of every .m4a file,
**and** (new in v1.1) extract extra tempo information:

•  Embedded tempo/BPM tag from the file’s metadata (iTunes ‘tmpo’ atom).  
•  An instantaneous tempo curve so you can see how BPM fluctuates over time.

Dependencies
------------
    pip install librosa mutagen numpy

Usage
-----
    python extract_meta_from_media.py
"""
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

import os
from typing import List, Tuple, Optional

import numpy as np
import librosa
from mutagen.mp4 import MP4


# --------------------------------------------------------------------------- #
#                               Core routines                                 #
# --------------------------------------------------------------------------- #
def compute_tempo(
    audio_file_path: str,
    sr_target: int | None = None
) -> Tuple[float, List[float]]:
    """
    Return (global_bpm, tempo_curve).

    Parameters
    ----------
    audio_file_path : str
        Path to an audio file (.m4a).
    sr_target : int | None
        Target sample-rate for librosa.load (None = original file rate).

    Returns
    -------
    global_bpm : float
        Single BPM estimate from librosa’s beat tracker.
    tempo_curve : list[float]
        Frame-level BPMs returned by librosa.beat.tempo(..., aggregate=None).
    """
    y, sr = librosa.load(audio_file_path, sr=sr_target)

    # Global BPM via beat tracking
    global_bpm, _ = librosa.beat.beat_track(y=y, sr=sr)

    # Instantaneous tempo curve
    tempo_curve = librosa.beat.tempo(y=y, sr=sr, aggregate=None)

    return float(global_bpm), tempo_curve.tolist()


def read_tagged_tempo(audio_file_path: str) -> Optional[float]:
    """
    Fetch embedded tempo/BPM tag (iTunes ‘tmpo’ atom) if present.
    Returns None when no tag is found or the file type is unsupported.
    """
    try:
        audio = MP4(audio_file_path)
        if "tmpo" in audio.tags:          # ‘tmpo’ is usually a single int
            return float(audio.tags["tmpo"][0])
    except Exception:
        pass                              # Unsupported container or no tag
    return None


# --------------------------------------------------------------------------- #
#                                Main driver                                  #
# --------------------------------------------------------------------------- #
def main() -> None:
    desktop_path = os.path.join(os.path.expanduser("~"), "Desktop")
    m4a_folder   = os.path.join(desktop_path, "m4a")

    if not os.path.isdir(m4a_folder):
        print(f"Folder not found: {m4a_folder}")
        return

    m4a_files = sorted(
        f for f in os.listdir(m4a_folder) if f.lower().endswith(".m4a")
    )
    if not m4a_files:
        print(f"No .m4a files found in {m4a_folder}")
        return

    for filename in m4a_files:
        file_path = os.path.join(m4a_folder, filename)
        print(f"\nProcessing {filename} …")
        try:
            global_bpm, tempo_curve = compute_tempo(file_path)
            tagged_tempo = read_tagged_tempo(file_path)

            print(f"Estimated global BPM    : {global_bpm:.2f}")
            if tagged_tempo is not None:
                print(f"Embedded tempo tag      : {tagged_tempo:.2f} BPM")
            else:
                print("Embedded tempo tag      : – (none)")

            if tempo_curve:
                arr = np.array(tempo_curve)
                print(
                    "Instantaneous tempo stats:"
                    f" min {arr.min():.2f}"
                    f" | mean {arr.mean():.2f}"
                    f" | max {arr.max():.2f} BPM"
                )
                # Uncomment if you want to peek at the first few entries
                # print('Tempo curve (first 10):', ', '.join(f'{v:.2f}' for v in arr[:10]))

        except Exception as exc:
            print(f"Error processing {filename}: {exc}")


if __name__ == "__main__":
    main()  

4. Explanation of Key Enhancements

Componentv1.0 Behaviourv1.1 Upgrade
read_tagged_tempo() Uses mutagen to pull the iTunes BPM tag (tmpo) if it exists.
compute_tempo() Returned a single BPM value. Also returns a frame-level tempo curve via librosa.beat.tempo(..., aggregate=None).
Console output Only global BPM printed. Adds embedded tag (if present) plus min/mean/max of the tempo curve for quick insight.
Dependencies librosa, soundfile Now librosa, mutagen, numpy (soundfile is still auto-pulled by librosa).

5. Program Flow Diagram (Updated)

┌────────────────────────────┐
│   Start Script             │
└────────────────────────────┘
            │
            ▼
┌────────────────────────────┐
│ 1. Verify ~/Desktop/m4a    │
└────────────────────────────┘
            │
            ▼
┌────────────────────────────┐
│ 2. List all .m4a files     │
└────────────────────────────┘
            │
   ┌────────┴─────────┐
   │ Any files found? │
   └────────┬─────────┘
      Yes   │   No
            │
            ▼
┌────────────────────────────────────┐
│ 3. For each file:                  │
│    • Estimate global BPM           │
│    • Read embedded BPM tag         │
│    • Compute tempo curve           │
│    • Print results                 │
└────────────────────────────────────┘
            │
            ▼
┌────────────────────────────┐
│          End               │
└────────────────────────────┘

6. Usage Instructions

  1. Activate your venv each session (from ~/Desktop):
    source venv/bin/activate
  2. Run the script:
    python extract_meta_from_media.py
  3. Inspect output—for each track you’ll see:
    Processing song1.m4a …
    Estimated global BPM    : 128.12
    Embedded tempo tag      : 128.00 BPM
    Instantaneous tempo stats: min 127.50 | mean 128.05 | max 128.60 BPM
  4. When finished, deactivate:
    deactivate

Written on May 18, 2025


Python Script for BPM & Tempo Extraction from Multiple Media Files (Written June 21, 2025)

This document presents extract_meta_from_media.py (v1.2), an upgraded Python script that scans ~/Desktop/media for audio-capable files (.m4a, .mp3, .mp4), computes each track’s global BPM, and extracts embedded tempo tags plus an instantaneous tempo curve for detailed musical analysis.

1. Objective

The script will:

  1. Locate all supported files (.m4a, .mp3, .mp4) in the media folder on your Desktop.
  2. For each file:
    • Estimate its global BPM using librosa.
    • Read any embedded BPM tag:
      – iTunes tmpo atom for .m4a/.mp4
      – ID3 TBPM frame (or EasyID3 “bpm”) for .mp3
    • Generate a frame-level tempo curve to reveal BPM fluctuations over time.
  3. Print a concise report to the console for every track.

2. Prerequisites

  1. Python 3.8+
  2. Virtual environment (recommended)
    From ~/Desktop:
    python3 -m venv venv
    source venv/bin/activate
    pip install --upgrade pip
  3. Libraries
    pip install librosa mutagen numpy
    Tip: Install FFmpeg for wider codec support:
    # macOS (Homebrew)
    brew install ffmpeg
  4. Folder structure
    Desktop/
    ├── extract_meta_from_media.py
    └── media/
        ├── song1.m4a
        ├── track2.mp3
        ├── clip3.mp4
        └── …

3. Implementation

The complete v1.2 source code is reproduced below.

#!/usr/bin/env python3
"""
Filename  : extract_meta_from_media.py
Version   : 1.2
Author    : Hyunsuk Frank Roh

Description
-----------
Walk through ~/Desktop/media, estimate the *global* BPM of every audio-capable
file (.m4a, .mp3, .mp4), **and** extract extra tempo information:

•  Embedded tempo/BPM tag from the file’s metadata  
   – iTunes 'tmpo' atom for .m4a / .mp4  
   – ID3 'TBPM' (or EasyID3 "bpm") for .mp3  
•  An instantaneous tempo curve so you can see how BPM fluctuates over time.

Dependencies
------------
    pip install librosa mutagen numpy

Usage
-----
    python extract_meta_from_media.py
"""
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

import os
from typing import List, Tuple, Optional

import numpy as np
import librosa
from mutagen.mp4 import MP4
from mutagen import File as MutagenFile


# --------------------------------------------------------------------------- #
#                               Core routines                                 #
# --------------------------------------------------------------------------- #
def compute_tempo(
    audio_file_path: str,
    sr_target: int | None = None
) -> Tuple[float, List[float]]:
    """
    Return (global_bpm, tempo_curve).
    """
    y, sr = librosa.load(audio_file_path, sr=sr_target, mono=True)

    # Global BPM via beat tracking
    global_bpm, _ = librosa.beat.beat_track(y=y, sr=sr)

    # Instantaneous tempo curve
    tempo_curve = librosa.beat.tempo(y=y, sr=sr, aggregate=None)

    return float(global_bpm), tempo_curve.tolist()


def read_tagged_tempo(audio_file_path: str) -> Optional[float]:
    """
    Return embedded BPM tag (if any) or None.
    """
    ext = os.path.splitext(audio_file_path)[1].lower()
    try:
        if ext in {".m4a", ".mp4"}:
            audio = MP4(audio_file_path)
            if "tmpo" in audio.tags:
                return float(audio.tags["tmpo"][0])
        elif ext == ".mp3":
            audio = MutagenFile(audio_file_path)
            if audio and audio.tags:
                if "bpm" in audio.tags:
                    return float(audio.tags["bpm"][0])
                if "TBPM" in audio.tags:
                    return float(audio.tags["TBPM"].text[0])
    except Exception:
        pass
    return None


# --------------------------------------------------------------------------- #
#                                Main driver                                  #
# --------------------------------------------------------------------------- #
def main() -> None:
    desktop_path = os.path.join(os.path.expanduser("~"), "Desktop")
    media_folder = os.path.join(desktop_path, "media")

    if not os.path.isdir(media_folder):
        print(f"Folder not found: {media_folder}")
        return

    audio_exts = {".m4a", ".mp3", ".mp4"}

    media_files = sorted(
        f for f in os.listdir(media_folder)
        if os.path.splitext(f)[1].lower() in audio_exts
    )
    if not media_files:
        print(f"No supported audio files found in {media_folder}")
        return

    for filename in media_files:
        file_path = os.path.join(media_folder, filename)
        print(f"\nProcessing {filename} …")
        try:
            global_bpm, tempo_curve = compute_tempo(file_path)
            tagged_tempo = read_tagged_tempo(file_path)

            print(f"Estimated global BPM    : {global_bpm:.2f}")
            if tagged_tempo is not None:
                print(f"Embedded tempo tag      : {tagged_tempo:.2f} BPM")
            else:
                print("Embedded tempo tag      : – (none)")

            if tempo_curve:
                arr = np.array(tempo_curve)
                print(
                    "Instantaneous tempo stats:"
                    f" min {arr.min():.2f}"
                    f" | mean {arr.mean():.2f}"
                    f" | max {arr.max():.2f} BPM"
                )
        except Exception as exc:
            print(f"Error processing {filename}: {exc}")


if __name__ == "__main__":
    main()

4. Key Enhancements over v1.1

Component v1.1 Behavior v1.2 Upgrade
Target folder ~/Desktop/m4a ~/Desktop/media with mixed formats
Supported extensions .m4a .m4a, .mp3, .mp4
read_tagged_tempo() iTunes tmpo only Adds ID3 TBPM / EasyID3 “bpm” for .mp3
Error handling Basic Robust across multiple formats
Console output Per-track stats for .m4a Same stats for all supported formats

5. Program Flow Diagram (Updated)

┌────────────────────────────┐
│        Start Script        │
└────────────────────────────┘
            │
            ▼
┌────────────────────────────┐
│ 1. Verify ~/Desktop/media  │
└────────────────────────────┘
            │
            ▼
┌────────────────────────────┐
│ 2. List .m4a/.mp3/.mp4     │
└────────────────────────────┘
            │
   ┌────────┴─────────┐
   │ Any files found? │
   └────────┬─────────┘
      Yes   │   No
            │
            ▼
┌──────────────────────────────────────────────┐
│ 3. For each file:                            │
│    • Estimate global BPM                     │
│    • Read embedded BPM tag (if any)          │
│    • Compute tempo curve                     │
│    • Print results                           │
└──────────────────────────────────────────────┘
            │
            ▼
┌────────────────────────────┐
│           End              │
└────────────────────────────┘

6. Usage Instructions

  1. Activate your venv (each session):
    source venv/bin/activate
  2. Run the script:
    python extract_meta_from_media.py
  3. Inspect output — example:
    Processing track2.mp3 …
    Estimated global BPM    : 124.37
    Embedded tempo tag      : 125.00 BPM
    Instantaneous tempo stats: min 123.90 | mean 124.25 | max 125.10 BPM
  4. When finished, deactivate:
    deactivate

Happy beat tracking!

Written on June 21, 2025


Mathematical Models


Summing Audio Tracks in Logic Pro (Written May 31, 2025)

Logic Pro carries out calculations in the linear domain (floating-point amplitudes) but shows levels in dBFS. Each track’s gain, pan law, and plug-in chain are applied linearly, the results are summed, and only then is the value converted back to dB for the master fader.

The Core Equation 🔬

\[ S_{\text{mix}}(t)=\sum_{i=1}^{N} g_i\,s_i(t) \] \[ \text{dBFS}=20\log_{10}\!\bigl(\lvert S_{\text{mix}}(t)\rvert\bigr) \]

Because decibels are logarithmic, dB values cannot be added directly; each track must first be converted to linear amplitude (or power) before summation.

Equal vs. Weighted Summation

  1. Equal Weighting (Default)

    • A fader at 0 dB means a linear gain of 1. Two identical, phase-aligned mono tracks at 0 dB rise by +3 dB at the stereo output (pan law accounted for).
    • Real-world material seldom aligns perfectly, so typical boosts are closer to +1 – +2 dB.
  2. Custom Weighting with Faders

    • Lowering a track to -6 dB multiplies its samples by 0.5. In the equation above the term becomes \(0.5\,s_i(t)\), effectively halving that track’s influence.
    • Dynamics processors, sends, and other inserts introduce further, track-specific weighting before the mix bus.

Pan Law Considerations 🌀

Logic Pro’s default pan law is -3 dB center. A mono track panned hard left or right keeps full amplitude on one side, whereas a centered mono signal is attenuated (0.707×) on each side to preserve perceived loudness.

Worked Example 📊

Track Fader (dB) Linear Gain (g) Peak (dBFS) Contribution
to Mix (dBFS)
Kick01.00-6-6.0
Bass-4.50.60-9-13.2
Pads (stereo)-60.50-12-18.0
Summed Peak (linear)≈ -4.0 dBFS

Practical Guidance 🎚️

  1. Maintain head-room: keep master peaks between -6 dBFS and -3 dBFS to avoid inter-sample clipping when tracks reinforce one another.
  2. If the mix bus clips, trim individual faders rather than lowering the master fader to preserve plug-in gain staging.
  3. Use VU-style meters for perceived loudness; peak meters alone cannot reveal RMS energy buildup.

Written on May 31, 2025


Digital waveform amplitude & bidirectional dynamics (Written May 31, 2025)

Acoustic events are stored as waveforms. The vertical axis shows instantaneous amplitude; the horizontal axis shows time. Greater distance from the mid-line (zero) means greater air-pressure deviation and therefore louder perceived sound.

I. Digital full-scale reference (0 dBFS)

In PCM systems every sample is a signed number between -1.0 and +1.0. Both limits equal 0 dB full scale (0 dBFS). Attempts to exceed them cause quantization overflow; data are truncated and clipping distortion occurs.

When |sample| ≥ 1.0 (0 dBFS) the waveform is clipped. Logic Pro peak meters turn red to indicate this condition.

II. Ideal sinusoid and amplitude limit

An ideal sine of frequency f and phase ϕ is \[ A(t)=A_{\max}\sin\!\bigl(2\pi f t+\phi\bigr) \]. To avoid clipping require \(A_{\max}\le 1.0\).

Chart 1 — Sine wave approaching 0 dBFS

III. Bidirectional amplitude and the mid-line

A. Physical interpretation

A loudspeaker diaphragm moves forward (compression) and backward (rarefaction). Digital audio encodes this as a signed-value stream:

Sample value Acoustic state Perceptual result
+1.0 → 0.0CompressionLoud phase
0.0EquilibriumSilence / zero crossing
0.0 → -1.0RarefactionEqually loud, opposite polarity

B. Why polarity sounds identical 🙌

C. Mid-line (0) as a diagnostic reference ✨

  1. Zero crossings reveal fundamental frequency.
  2. DC offset lifts the whole waveform, wasting headroom and inviting clipping; apply high-pass or DC-removal.
  3. Digital silence = continuous zeros; any non-zero sample creates audible output.

Chart 2 — Compression (v ≥ 0) vs rarefaction (v < 0)

IV. Practical gain-staging recommendations 🚀

  1. Record peaks at least 3 dB below 0 dBFS to preserve headroom.
  2. Insert a brick-wall limiter on the master bus if track summation risks clipping.
  3. React immediately to red peak indicators by lowering track gain.

V. Engineering takeaways

VI. Summary

Waveform height from the mid-line encodes loudness. Exceeding ±1.0 causes clipping at 0 dBFS. Because ears sense absolute pressure change, positive and negative peaks sound the same. Thoughtful gain staging—keeping ample headroom and monitoring polarity symmetry—prevents distortion and maintains audio quality.

Compiled May 31, 2025

Written on June 7, 2025


Perceptual loudness normalization for multitrack mixing (Written June 7, 2025)

Balancing track levels by perceived loudness relies on two pillars: the Equal-Loudness Contour (ISO 226) that models frequency sensitivity and the ITU-R BS.1770 algorithm that outputs integrated loudness in LUFS. A streamlined workflow:

  1. Process every stem through the BS.1770 K-weighting filter and read its integrated LUFS.
  2. Select a platform-appropriate target, for example −16 LUFS for podcasts.
  3. Apply the simple gain offset  \( \Delta G_{\text{dB}} = L_{\text{target}} - L_{\text{track}} \) via a fader or Gain plug-in.

Advanced scripts replace step 3 with a Zwicker specific-loudness or partial-loudness routine that respects critical-band masking. Logic Pro’s Loudness Meter + Gain plug-ins are sufficient, while commercial tools such as iZotope Neutron and Sonible smart:limit automate the entire process internally.

I. Frequency-dependent human hearing

II. Practical standard — ITU-R BS.1770 K-weighting / LUFS

  1. Core measurement formula

    \( L_{\text{LKFS}} = -0.691 + 10 \log_{10}\!\Bigl(\displaystyle\sum_{i} G_i \, \overline{x_{i,K}^2}\Bigr) \)

    Integrated loudness sums K-weighted mean-square energy across channels, converts the result to decibels referenced to full scale, and applies an empirically derived −0.691 dB offset so that calibrated pink noise reads 0 LU.

  2. Term-by-term breakdown

    • \( x_{i,K}(t) \): sample of channel i after the K-weighting filter (60 Hz high-pass + 4 dB high-shelf at 4 kHz).
    • \( \overline{x_{i,K}^2} \): mean-square energy inside a 400 ms analysis block.
    • \( G_i \): channel weight that compensates for surround placement (see matrix below).
    • 10 log10: converts summed power to decibels relative to digital full scale.
    • −0.691 dB: bias aligning the objective value with subjective loudness tests.
  3. Channel weight matrix \(G_i\)

    ChannelWeightRationale
    L / R / C / LFE1.00On-axis reference
    LS / RS1.41Rear speakers radiate off-axis
    Height (immersive)1.00Elevation is inherently prominent
  4. Dual-gate time integration

    Each 400 ms block first passes an absolute gate at −70 LKFS, then a relative gate 10 dB below the running average. This rejects silence and low-level ambience, focusing the metric on program-relevant loudness.

  5. LU, LKFS, and LUFS

    One Loudness Unit (LU) equals 1 dB when measured with BS.1770. LUFS (loudness units relative to full scale) is therefore numerically identical to LKFS; for example, YouTube targets about −14 LUFS.

  6. Origin of the −0.691 dB offset

    Listening tests with full-band pink noise revealed a systematic 0.691 dB gap between perceived loudness and calculated energy, prompting inclusion of the constant for perceptual alignment.

  7. Worked example

    A stereo mix measures −18.2 LUFS (L) and −18.0 LUFS (R):
    \( \displaystyle L_{\text{mix}} = -0.691 + 10 \log_{10}\!\bigl(10^{-1.82} + 10^{-1.80}\bigr) \approx -18.1 \text{ LUFS} \)
    To hit a podcast target of −16 LUFS:
    \( \Delta G = -16 - (-18.1) = +2.1 \text{ dB} \) of gain is required.

III. Per-track automatic gain equation

StepOperationPurpose
K-weightingMimic human frequency response
Short-term LUFS (400 ms)Estimate perceived level
\( \Delta G = L_{\text{target}} - L_{\text{track}} \)Compute gain offset
Apply Gain / write fader automationNormalize track loudness

Typical targets: −23 LUFS (broadcast), −16 LUFS (streaming & podcasts), −14 LUFS (mainstream music video).

IV. Spectral fine-tuning — Zwicker & partial loudness

V. Logic Pro practical workflow

  1. Insert Loudness Meter on each stem, solo, and read the integrated LUFS.
  2. Match the target by trimming Gain or the channel fader by \( \Delta G \).
  3. Use Volume Relative automation for section-specific offsets without altering the static fader position.
  4. Finish with Loudness Range checks to confirm macro-dynamics.
  5. Optional: engage an AI assistant (Neutron Mix Assistant, smart:limit) for one-click loudness alignment and masking analysis.

VI. Limitations & best practice

Key equation recap ✏️

\( \boxed{\; \Delta G_{\text{dB}} = L_{\text{target (LUFS)}} - L_{\text{track (LUFS)}} \;} \)

Running this subtraction in a loop or script updates every fader so the mix starts from a scientifically grounded loudness foundation, ready for creative processing.

Written on June 7, 2025


Bit depth and sample rate in digital audio (Written June 7, 2025)

I. Core definitions

비트 깊이는 얼마나 세밀하게 진폭을 기술하는지를, 샘플레이트는 얼마나 자주 이를 기록하는지를 결정한다. 두 요소가 결합해 기계가 저장할 수 있는 수치적 충실도와 사람이 들을 수 있는 지각적 충실도를 동시에 규정한다.

II. Mathematical consequences

III. Practical meaning for devices 🖥️

IV. Perceptual meaning for listeners 👂

V. Comparison table

Configuration Sample rate Bit depth Theoretical
dynamic range
Primary use case
CD Audio44.1 kHz16-bit≈ 98 dBConsumer music distribution
Broadcast WAV48 kHz24-bit≈ 146 dBFilm / streaming production
Hi-Res96 kHz24-bit≈ 146 dBArchival & audio restoration
DXD352.8 kHz24-bit≈ 146 dBHybrid PCM/DSD workflows

VI. Best-practice guidelines ✅

Key formulas recap ✏️

\( f_s \ge 2 f_{\max} \)  — Nyquist criterion

\( \text{SQNR} \approx 6.02 N + 1.76 \;\text{dB} \)  — dynamic range per bit depth

Bit depth determines how finely amplitude is described; sample rate determines how often those descriptions occur. Together they define both the numerical fidelity a machine can store and the perceptual fidelity a human can hear.

Written on June 7, 2025


Logarithmic perception of pitch and loudness in human hearing (Written June 7, 2025)

I. Frequency and perceived pitch

A. Octave equivalence

The auditory system interprets pitch on a base-2 logarithmic axis. An octave step is defined by (\(P = \log_{2}\! \bigl(f / f_{0}\bigr)\)), so doubling frequency raises pitch by exactly one octave. For example, 27.5 Hz (A0) → 55 Hz (A1) → 110 Hz (A2).

B. Psychoacoustic refinements

The mel scale offers finer resolution: (\(\text{mel} \approx 2595 \log_{10} (1 + f/700)\)). Low-frequency bins appear densely packed, while spacing widens toward the treble, mirroring subjective pitch growth.

II. Sound-pressure level and perceived loudness

A. Decibel definition

Sound-pressure level (SPL) employs a base-10 logarithm: (\(L_{\text{dB}} = 20 \log_{10} (p / p_{0})\)), with \(p_{0} = 20\;\mu\text{Pa}\) as the threshold-of-hearing reference. A 6 dB increase doubles pressure amplitude yet is judged only “slightly louder,” honoring the Weber–Fechner law (\(S = k \log (I / I_{0})\)).

III. Piano keyboard versus auditory limits 🎹

Key position Frequency (Hz) Perceptual notes
A027.5Lowest practical musical pitch; borderline tactile
A4440Concert-pitch reference
C8≈ 4186Highest piano key; clearly audible to most listeners
+1 octave≈ 8 kHzAudible but devoid of distinct melodic identity
+2 octaves≈ 16 kHzPerceived by youth; sensitivity declines with age

Frequencies below 20 Hz (e.g., 13.75 Hz, one octave beneath A0) exceed the cochlea’s temporal-resolution limit; vibrations are sensed as rhythmic flutter rather than tonal pitch.

IV. Rationale for sub-20 Hz filtration 🛠️

V. Age-related high-frequency decline 👂

Key formulas recap ✏️

\(P = \log_{2} (f / f_{0})\) — octave-based pitch index

\(L_{\text{dB}} = 20 \log_{10} (p / p_{0})\) — sound-pressure level

Pitch and loudness are transduced through logarithmic mappings, enabling the auditory system to condense an enormous dynamic and spectral span into a manageable perceptual range. Musical instrument design, audio metering, and mix-engineering practices therefore align with base-2 and base-10 log scales to remain compatible with human hearing.

Written on June 7, 2025


The mathematical foundations of musical harmony (Written June 8, 2025)

Musical harmony rests upon deep mathematical principles. The present overview respectfully examines the key equations and structures that underlie tonal organization, tuning, and chordal relationships, offering a concise yet comprehensive synthesis for scholarly publication.

Frequency, pitch, and the harmonic series

When a resonant body vibrates at a fundamental frequency \(f_{0}\), overtones arise at integer multiples \(n\,f_{0}\). This integer progression, termed the harmonic series, shapes consonance perception and tonal color.

Descriptive alt text
Harmonic series frequencies for the first sixteen partials \((f_{0}=100\text{ Hz})\).

Tuning systems and frequency equations

  1. Just intonation

    Just intonation defines every interval by a simple rational ratio \(p:q\). For example, the perfect fifth employs \(3:2\). Given a fundamental \(f_{0}\), any pitch in a just system is \(f = \tfrac{p}{q}\,f_{0}\).

  2. Equal temperament

    In twelve-tone equal temperament (12-TET) the octave is divided logarithmically. The frequency of a note \(n\) semitones above the reference is \(f(n) = f_{0}\,2^{\,n/12}\). This exponential equation ensures transpositional symmetry but introduces minute deviations from just ratios.

    • Octave invariance: doubling frequency every twelve steps.
    • Modular arithmetic: pitch classes operate in \( \mathbb{Z}_{12} \).
    • Circle of fifths: successive seven-semitone moves trace the multiplicative group modulo 12.
  3. Cents and logarithmic measurement

    Pitch distance is often expressed in cents, where one cent equals \(1/100\) of a semitone: \(c = 1200 \log_{2}\!\bigl(\tfrac{f_{2}}{f_{1}}\bigr).\)

    Interval Just intonation ratio Equal temperament ratio Cent difference (JI – ET)
    Unison1/11.000000+0.00
    Minor second16/151.059463+11.73
    Major second9/81.122462+3.91
    Minor third6/51.189207+15.64
    Major third5/41.259921−13.69
    Perfect fourth4/31.334840−1.96
    Tritone45/321.414214−9.78
    Perfect fifth3/21.498307+1.96
    Minor sixth8/51.587401+13.69
    Major sixth5/31.681793−15.64
    Minor seventh9/51.781797+17.60
    Major seventh15/81.887749−11.73
    Octave2/12.000000+0.00

Chord structures and vector spaces

  1. Pitch-class set theory

    Chordal identity may be encoded as ordered or unordered pitch-class sets within \(\mathbb{Z}_{12}\). Operations of transposition \(T_{n}\) and inversion \(I_{n}\) correspond to affine transformations preserving set equivalence classes.

  2. Fourier representations

    The discrete Fourier transform (DFT) of pitch-class occurrences yields phase-angle spectra, illuminating interval content and aiding similarity measures between chords or scales.

Transformational theory and group operations

  1. Neo-Riemannian PLR group

    Transformations Parallel (P), Leittonwechsel (L), and Relative (R) act on triads, forming the dihedral group \(D_{6}\). Matrix encoding facilitates algebraic navigation through triadic space, modeling smooth harmonic progressions.

Mathematical models of voice leading

  1. Geometric chord space

    Recent studies embed voice leading as geodesic motion within high-dimensional orbifolds, where distance metrics correspond to total voice displacement. This geometric framework explicates common-tone retention and parsimonious motion.

Written on June 8, 2025


Waveform Analysis of Sound Mikio Tohyama

[Chapter 2] Discrete sequences and their Fourier transform (Written January 25, 2026)

A modest overview is presented on discrete sequences, generating functions, convolution, feedback stability in the \(z\)-domain, and the Fourier transform on the unit circle. The discussion is intentionally introductory, yet attempts to preserve the structural relationships that make these tools effective in signal analysis.

I. From continuous-time functions to discrete sequences

  1. Sequence notation and sampling

    Discrete-time analysis often replaces a continuous function \(s(t)\) with a sequence \(x(n)\) indexed by an integer \(n\). A common sampling model selects values every \(T_s\) seconds and forms a sequence such as

    \[ x(n) = T_s\, s(t)\bigl\rvert_{t=nT_s}. \]

    Here \(T_s\) is the sampling period, and the sampling frequency is \(F_s = 1/T_s\) (Hz). The scaling factor \(T_s\) is sometimes included to maintain consistency with integral–sum relationships; the essential point is that the signal becomes a list of values indexed over integers.

  2. Core symbols used throughout

    • \(n\): integer sample index
    • \(t\): continuous-time variable (used only to define sampling)
    • \(T_s\) and \(F_s\): sampling period and sampling frequency
    • \(z\): complex variable in the \(z\)-domain; stability is tied to locations inside the unit disc
    • \(\Omega\): normalized angular frequency, typically \(\Omega = \omega T_s\)

II. Generating functions and convolution

  1. Generating function as a formal power series

    A discrete sequence \(a(n)\) can be associated with a generating function (formal power series) in a variable \(X\):

    \[ A(X) = \sum_{m} a(m) X^{m}, \qquad B(X) = \sum_{n} b(n) X^{n}. \]

    Although \(X\) may be treated as an indeterminate (formal variable), the algebraic structure already reveals how sequences combine through multiplication.

  2. Convolution derived from polynomial multiplication

    Multiplying generating functions produces a new series \(C(X)=A(X)B(X)\):

    \[ \begin{aligned} C(X) &= \left(\sum_{m} a(m)X^{m}\right)\left(\sum_{n} b(n)X^{n}\right) \\ &= \sum_{p} c(p)X^{p}, \end{aligned} \qquad c(p) = \sum_{m} a(m)\,b(p-m). \]

    The coefficients \(c(p)\) define the convolution of \(a(n)\) and \(b(n)\). This operation is commutative because the product \(A(X)B(X)\) is commutative, yielding \(a*b=b*a\).

  3. A small worked example

    Consider the finite sequences \(a=\{1,1\}\) and \(b=\{1,-1\}\). Their convolution forms \(c=a*b\) with coefficients:

    Index \(n\) \(c(n)\) Computation
    0 1 \(c(0)=a(0)b(0)=1\cdot 1\)
    1 0 \(c(1)=a(0)b(1)+a(1)b(0)=1\cdot(-1)+1\cdot 1\)
    2 -1 \(c(2)=a(1)b(1)=1\cdot(-1)\)

    Therefore \(\{1,0,-1\} = \{1,1\} * \{1,-1\}\). This illustrates a practical interpretation: convolution computes the coefficients of a product series.

III. z-domain feedback, poles, and stability

  1. A closed-loop model and its transfer function

    A feedback loop may be modeled by two transfer functions in the \(z^{-1}\) domain: an open-loop block \(G(z^{-1})\) and a feedback path \(H(z^{-1})\). The closed-loop transfer function can be written as

    \[ L(z^{-1}) = \frac{H(z^{-1})}{1 - G(z^{-1})H(z^{-1})} = H(z^{-1})\,\frac{1}{E(z^{-1})}, \qquad E(z^{-1}) = 1 - G(z^{-1})H(z^{-1}). \]

    The denominator \(E(z^{-1})\) governs the pole locations of the closed loop. When poles drift outside the unit disc, the loop may exhibit runaway amplification, which in acoustics can manifest as sustained howling or “singing.”

  2. Stability criteria and the unit disc

    A commonly used stability requirement is that the impulse response \(f(n)\) of the loop be square-summable:

    \[ \sum_{n=0}^{\infty} \lvert f(n)\rvert^{2} < \infty. \]

    This condition is satisfied when all poles of the closed-loop transfer function lie strictly inside the unit disc. Equivalently, the zeros of \(E(z^{-1})\) must lie inside the unit disc. On the unit circle \(z=e^{i\Omega}\), a related engineering check compares the magnitude of the open-loop product \(G(z^{-1})H(z^{-1})\) against unity.

  3. Single-zero illustration

    Consider a simplified case with

    \[ H(z^{-1}) = 1 - a z^{-1}, \qquad G(z^{-1}) = b, \quad 0<b<1. \]

    The closed-loop transfer becomes

    \[ L(z^{-1}) = \frac{1-a z^{-1}}{1-b(1-a z^{-1})} = \frac{1-a z^{-1}}{1-b}\cdot\frac{1}{1-\alpha z^{-1}}, \qquad \alpha = -\frac{ab}{1-b}. \]

    The associated impulse response takes the form

    \[ f(n)=\frac{\alpha^{n}}{1-b}, \qquad n\ge 0. \]

    Stability follows when \(|\alpha|<1\), which is precisely the requirement that the pole \(z=\alpha\) remain inside the unit disc.

  4. Ideal inverse feedback and a practical caution

    An idealized way to suppress positive feedback inserts an inverse block

    \[ G_i(z^{-1}) = -\frac{1}{H(z^{-1})} = -H^{-1}(z^{-1}). \]

    With a constant gain \(G(z^{-1})=b>0\), the resulting closed-loop response simplifies to

    \[ L(z^{-1}) = \frac{H(z^{-1})}{1+b}. \]

    This form contains no closed-loop poles introduced by feedback, so instability is avoided in the algebraic model. However, inverse systems are not always physically realizable or stable. A stable inverse generally requires all zeros of \(H(z^{-1})\) to lie inside the unit disc.

IV. Fourier transform on the unit circle

  1. Fourier transform as a unit-circle evaluation

    The \(z\)-transform of a sequence provides a complex function \(X(z^{-1})\). Evaluating it on the unit circle \(z=e^{i\Omega}\) yields the Fourier transform:

    \[ X(e^{-i\Omega}) = \sum_{n=-\infty}^{\infty} x(n)e^{-i\Omega n}. \]

    The angle \(\Omega\) is a normalized angular frequency, commonly \(\Omega=\omega T_s\). Since \(e^{-i(\Omega+2\pi)n}=e^{-i\Omega n}\), the spectrum is periodic in \(\Omega\) with period \(2\pi\).

  2. Frequency response interpretation

    When \(x(n)\) is an impulse response of a linear time-invariant system, \(X(e^{-i\Omega})\) is the system’s frequency response. Magnitude and phase describe, respectively, gain and delay characteristics as functions of frequency.

V. Real and imaginary parts, even and odd symmetry

  1. Separating real and imaginary parts

    For a real, finite-length sequence supported on \(0\le n\le N-1\), the Fourier transform can be decomposed into cosine and sine sums:

    \[ \Re\{X(e^{-i\Omega})\} = \sum_{n=0}^{N-1} x(n)\cos(\Omega n), \qquad \Im\{X(e^{-i\Omega})\} = -\sum_{n=0}^{N-1} x(n)\sin(\Omega n). \]

    The real part is an even function of \(\Omega\), while the imaginary part is an odd function of \(\Omega\).

  2. Even and odd sequences

    An even sequence satisfies \(x_e(n)=x_e(-n)\), and an odd sequence satisfies \(x_o(n)=-x_o(-n)\) with \(x_o(0)=0\). These symmetries yield simplified Fourier forms:

    \[ X_e(e^{-i\Omega}) = \sum_{n=0}^{N-1} x_e(n)\cos(\Omega n), \qquad X_o(e^{-i\Omega}) = -i\sum_{n=0}^{N-1} x_o(n)\sin(\Omega n). \]

    Accordingly, the transform of a real even sequence is purely real, while the transform of a real odd sequence is purely imaginary.

  3. Decomposing a causal sequence

    A causal sequence is supported on nonnegative indices:

    \[ x_c(n)= \begin{cases} x(n), & n\ge 0,\\ 0, & n<0. \end{cases} \]

    Such a sequence may be expressed as the sum of its even and odd parts:

    \[ x_c(n)=x_e(n)+x_o(n). \]

    This decomposition provides a structured way to relate cosine-based and sine-based contributions to the real and imaginary parts of the spectrum.

VI. Analytic representation, envelope, and instantaneous phase

  1. Complex exponentials behind real sinusoids

    Real sinusoids can be expressed as sums of complex exponentials at positive and negative frequencies:

    \[ \cos(\Omega_0 n)=\frac{1}{2}\left(e^{i\Omega_0 n}+e^{-i\Omega_0 n}\right), \qquad \sin(\Omega_0 n)=\frac{1}{2i}\left(e^{i\Omega_0 n}-e^{-i\Omega_0 n}\right). \]

    This representation clarifies why idealized sinusoidal spectra consist of two symmetric frequency components. Retaining only one side (positive or negative frequencies) reconstructs a corresponding complex sinusoid.

  2. Constructing an analytic spectrum

    The analytic representation of a real sequence is commonly defined by keeping only the nonnegative-frequency portion of the spectrum (doubling it except at \(\Omega=0\) and \(\Omega=\pi\)):

    \[ Z(e^{-i\Omega})= \begin{cases} 2X(e^{-i\Omega}), & 0<\Omega<\pi,\\ X(e^{-i\Omega}), & \Omega=0,\ \pi,\\ 0, & \pi<\Omega<2\pi. \end{cases} \]

    The inverse Fourier transform of \(Z(e^{-i\Omega})\) yields a complex sequence \(z(n)\) whose real part equals the original real sequence. A common notation is

    \[ z(n)=x(n)+iy(n), \]

    with a quadrature component \(y(n)\) that can be expressed (in terms of the real and imaginary parts of \(X(e^{-i\Omega})\)) as

    \[ y(n)=\frac{1}{\pi}\int_{0}^{\pi} \Bigl( X_r(e^{-i\Omega})\,\sin(n\Omega) +X_i(e^{-i\Omega})\,\cos(n\Omega) \Bigr)\,d\Omega. \]

  3. Magnitude–phase form and reconstruction

    An analytic sequence admits a polar form:

    \[ z(n)=x(n)+iy(n)=\lvert z(n)\rvert e^{i\theta(n)}. \]

    The instantaneous magnitude and phase are defined by

    \[ \lvert z(n)\rvert^{2}=x^{2}(n)+y^{2}(n), \qquad \theta(n)=\tan^{-1}\!\left(\frac{y(n)}{x(n)}\right). \]

    Consequently, the original real sequence can be written as

    \[ x(n)=\Re\{z(n)\}=\lvert z(n)\rvert\cos\bigl(\theta(n)\bigr). \]

  4. Compact reference table and a conceptual map

    Object Definition Primary role Typical insight
    Sequence \(x(n)\) Samples indexed by integers Time-domain description Supports convolution, causality, impulse response
    Generating function \(A(X)\) \(\sum_n a(n)X^n\) Algebraic manipulation Product \(\leftrightarrow\) convolution coefficients
    \(z\)-domain transfer \(H(z^{-1})\) Rational function in \(z^{-1}\) Feedback analysis Poles/zeros determine stability and resonance
    Fourier transform \(X(e^{-i\Omega})\) \(\sum_n x(n)e^{-i\Omega n}\) Spectral description Periodic spectrum; magnitude and phase vs frequency
    Analytic sequence \(z(n)\) Positive-frequency spectrum only Envelope and phase \(\lvert z(n)\rvert\) as envelope, \(\theta(n)\) as phase

    A minimal conceptual chart is included to summarize the main relationships:

    Sequence x(n)
      |
      |  z-transform / transfer representation: X(z), H(z^{-1})
      v
    Complex z-plane (poles and zeros)
      |
      |  Restrict to the unit circle: z = e^{iΩ}
      v
    Fourier transform X(e^{-iΩ})  (periodic in Ω, 2π)
      |
      |  Keep only nonnegative frequencies (analytic spectrum)
      v
    Analytic sequence z(n) = x(n) + i y(n)
      |
      |  Polar form
      v
    Envelope |z(n)|   and   instantaneous phase θ(n)
        

    Key takeaways. Convolution can be viewed as coefficient extraction from a product of generating functions. Feedback stability is governed by pole locations relative to the unit disc. The Fourier transform is obtained by evaluating the \(z\)-domain representation on the unit circle, and the analytic representation isolates positive-frequency content to yield envelope and instantaneous phase descriptions.

    The treatment above is necessarily selective. Nevertheless, the relationships collected here often provide a dependable scaffold for further study and applied work in discrete-time signal processing.

Written on January 25, 2026


Reference

Tohyama, M. (2015). Waveform analysis of sound (Mathematics for Industry, Vol. 3). Springer. ISBN 4431544232
Back to Top