beyond the veil

Singularity Awakens

Overview

Journey through the birth of consciousness, the veil of death, and the release into rebirth. ‘beyond the veil’ unfolds as one unbroken psychedelic shot. Ever present is the mysterious entity ‘singularity,’ a cosmic digital being of infinite forms who guides the viewer through a subliminal journey of impossible visuals. Each scene bleeds into the next — figures, objects, and worlds reshape in one continuous flow. Every sound has a motion; every drop, synth stab, orchestral swell, and kick drum is felt through the screen. Through these abstract visuals, the piece explores consciousness, reality, death, and the origins of the universe itself.

Process

Overview

‘beyond the veil’ was scored first and generated second. The audio drives every visual decision — which scenes begin where, which visuals pulse when, and which transitions carry the viewer from one shape to the next. A custom Python pipeline called beatlab analyzes the track through three layers of intelligence: digital signal processing, an audio-listening language model, and a creative-direction language model. The resulting plan directs a visual generation pipeline built on Google’s Nano Banana 2 (keyframes) and Veo 3.1 (transitions). 15,243 AI-generated candidates — 5,896 keyframes and 9,347 video transitions — were curated down to 3,538 finals in a custom web-based timeline editor called beatlab-synthesizer, where layers, blend modes, time-remap curves, and manually-placed beat accents were composed into the final piece.

Music Composition

The source audio for ‘beyond the veil’ was itself produced through an AI-assisted workflow. Musicful was used to generate approximately two hours of raw musical material. This material was then imported into GarageBand, where samples were cut, sequenced, and composed into the final 35-minute track.

The compositional process started with listening. I listened through all two hours of generated material, taking notes at every interesting moment — timestamp, sample, and a short description of what made the moment compelling. Those notes, along with a description of what I wanted the final piece to achieve compositionally, were handed to an LLM. The LLM returned a full assembly order: clip labels with timestamps, laid out in the sequence they should appear in the final composition.

From there, the work in GarageBand was execution — following the LLM’s instructions, blending transitions between samples, and performing the manual audio engineering that gives the final track its cohesion. The resulting 35-minute composition becomes the input to every subsequent step in the pipeline, explained below.

Step 1. Track Isolation

The full-mix audio is decomposed into individual stems through a chain of three specialized source-separation models, each chosen for what it does best. MDX23C-InstVoc-HQ splits the mix into vocals and instrumental; it was chosen for roughly twice the vocal-bleed rejection of Demucs, because any leaked vocal energy elsewhere in the pipeline creates phantom triggers downstream. MDX23C-DrumSep, run on the instrumental output, separates kick, snare, toms, hi-hat, ride, and crash. Running drum separation on the full mix produces vocal artifacts in drum stems, so the order matters. Demucs htdemucs_6s, also run on the instrumental, extracts bass, guitar, piano, and other — its own vocals and drums outputs are discarded, and only the four melodic stems are kept.

Step 2. Programmatic Audio Analysis

Each stem is analyzed with librosa, a Python digital signal processing library. Onset detection identifies the exact moment every sound begins, extracted per stem and per frequency band (low, mid, high), with strengths normalized percentile-wise so events are comparable across stems. RMS envelopes capture the loudness and energy curve of each stem over time, downsampled to roughly 20 points per second. Sustained region detection identifies continuous stretches where a stem is holding energy — pads, held chords, vocal notes. Spectral features — centroid, rolloff, and contrast — are extracted per stem.

Step 3. Musical Context (Gemini)

Each audio chunk of roughly 30 seconds is sent to Gemini 2.5 Flash, prompted to act as a professional music producer analyzing the stem for the purpose of syncing visual effects to every musical event. Gemini returns seven structured sections per chunk.

The primary output is an event log. Every audible musical event is logged with approximately one-second precision, tagged by event type: kick, snare, hi-hat, cymbal crash, tom, bass note, bass drop, bass sustain start and end, synth stab, synth pad start and end, synth lead, arpeggio, riser start and peak, drop, breakdown start, buildup start, vocal start and end, vocal chop, FX sweep, FX impact, silence start and end. Repeating patterns can be described as intervals rather than every onset. Sustained sounds receive both start and end timestamps. Gemini’s timestamps are approximate — they are cross-referenced against the DSP onset data from Step 2, which is millisecond-accurate, to get precise timing. Gemini tells us what happens; DSP tells us exactly when.

Alongside the event log, Gemini returns a rhythm analysis (BPM estimate, time signature, and per-instrument pattern description), an energy profile (intensity rated one to ten at the 0%, 25%, 50%, 75%, and 100% checkpoints of the chunk, plus any sudden energy changes), a catalog of sustained sounds (every pad, drone, held chord, reverb tail, riser, and sustained bass with character and duration), a list of key moments (the three to five most visually impactful moments in the chunk, with reasoning), and an inventory of every instrument heard.

The seventh section is mood and texture — a qualitative description of the section’s emotional character and production feel. Mood (for example, “serene, introspective, and slightly melancholic, yet imbued with warmth and intimacy”), emotional sensation, and production texture (for example, “soft, spacious, and atmospheric, driven by sustained, rich harmonies and the airy, reverberated quality of the lead vocal”).

Together these seven sections give the pipeline both a ground-truth inventory of what happens in the music and a qualitative read of how it feels. By the time the creative-direction step sees this data, the pipeline doesn’t just know that there’s “a drop at 2:30” — it knows there’s a sustained E bass entering near 2:30 with roughly two seconds of sustain, a crash cymbal right after, and a sweeping riser building in from earlier, and it knows the section is meant to feel triumphant, weightless, or oppressive depending on what the track is doing.

Step 4. LLM Creative Direction (Claude)

The DSP onsets and envelopes from Step 2 and Gemini’s full seven-section analysis from Step 3 are combined and handed to Claude Sonnet with three additional qualitative inputs.

The first is an effect catalog with written guidance. Each effect comes with prescriptive direction about when to use it. zoom_pulse is the workhorse — gentle zoom in/out for melodic hits, bass notes, and rhythmic elements. zoom_bounce is reserved for bass drops and heavy kicks. shake_x and shake_y are for percussive impacts — horizontal for snares, vertical for kicks and sub-bass. contrast_pop is for synth stabs and melodic accents. glow_swell is for sustained pads, ambient textures, and vocal sections. The catalog teaches Claude not just what each effect looks like, but when it should or shouldn’t be used.

The second is per-effect sensitivity settings from 0.0 to 1.0, creative direction delivered as a dial. High sensitivity on zoom_pulse means “trigger on nearly every relevant onset at high intensity”; low sensitivity on zoom_bounce means “only on the most dramatic moments.” Each level comes with human-readable guidance (for example, 0.95 and above reads as “MAXIMUM — overwhelming, relentless, nauseating visual intensity”).

The third is an optional creative prompt describing the vision for the track. For this piece, the prompt was: “journey through death to another dimension.”

Gemini’s mood and texture outputs do particular work here. They shape how Claude interprets the catalog: a melancholic, introspective section dials aggressive effects back even when the sensitivity settings are high, while a triumphant, weightless section gets layered stacks on every bass hit.

Crucially, Claude does not return a list of effect events. It returns a compact set of rules (24 rules were generated for this piece), each specifying: when a DSP onset matches this stem, this frequency band, and this strength range, apply this effect with these parameters. Rules can layer (stack multiple effects above a configurable strength threshold) and extend sustain (stretch effect duration to match detected sustained regions).

The rules-based design is deliberate for two reasons. The first is practical: a 35-minute track contains tens of thousands of onsets, and Claude’s output token limit makes returning per-event JSON infeasible — a couple dozen compact rules fit comfortably in a single response, while the equivalent event list would overflow many times over. The second is compositional: if Claude returned individual events, the 100th repetition of a kick pattern might get forgotten, or effects might cluster unevenly. Rules are applied programmatically to every matching onset — the 1st kick and the 500th are treated identically. This is how the visuals stay locked to the music across the full 35 minutes without drift. For ‘beyond the veil,’ the 24 rules expanded into 26,144 individual effect events on the final timeline.

Step 5. Effect Application

Claude’s rules are materialized into per-onset effect events, filtered through two more layers, and rendered to frames.

The first filter is automatic bleed suppression. Every non-vocal onset is checked against the vocal stem’s RMS envelope at that moment. If the stem’s RMS is less than 25% of the concurrent vocal RMS, the onset is suppressed as leakage. Without this filter, every vocal consonant would produce phantom kicks, snare cracks, and glow swells on instruments that aren’t actually playing.

The second layer is human curation. A custom timeline editor lets me place manual hit markers directly on the track at moments the automatic onset detector missed or underweighted. 152 hit markers were placed across this piece, and each becomes a guaranteed visual accent in the final render. The same editor is used to draw suppression zones to mute particular effects during particular moments.

The resulting curated effect map is applied to video frames by OpenCV — an open-source image and video processing library — in a single pass. Every pulse, zoom, shake, contrast shift, and glow is frame-accurate to the audio. Custom strobe and hue-shift effects were built specifically for this piece.

Visual Generation

Keyframes were generated exclusively with Nano Banana 2 (Google’s image model). Transitions between keyframes were generated exclusively with Veo 3.1 (Google’s video model). Across the 35-minute piece, 5,896 keyframe candidates were generated and curated down to 1,836 finals; 9,347 transition candidates were generated and curated down to 1,702 finals.

The video transition prompts themselves were mostly generated by an LLM, though a smaller number were hand-written by me for specific moments where I had a particular intent. For the generated ones, a bespoke prompt was synthesized for each transition by reading the two keyframes it was bridging, the musical description for the window (mood, events, instruments, and energy from Gemini’s analysis), and the visual content of both images. The LLM then produced a transition description telling Veo how the scene should unfold — which aspects of the outgoing image should morph into which aspects of the incoming one, what motion should accompany the musical events in that window, and what emotional register the transformation should land in. This is what allowed the piece to have more than 1,500 dynamic, contextually-aware transitions — each attuned to the specific music and imagery it bridged — without having to mentally model every transition by hand, and without falling back on a generic transition prompt that would have produced uniform, uninteresting motion.

Timeline Assembly & Compositing

Individual clips were assembled on a custom-built, web-based timeline editor (beatlab-synthesizer). For each clip, a time-remap curve was authored to sync significant moments of motion to specific sounds in the track — an explosion landing on a kick, a wisp of energy spiraling on a synth stab, or a ray of light piercing through on a bass drop.

A custom layer compositor was built for this project: blend modes (multiply, screen, overlay, difference, add, normal), opacity curves, and chroma keying, all implemented in Python with numpy and OpenCV to mirror the editor’s WebGL compositor exactly. Frame interpolation between clips is also my own — crossfades, time-remapped motion, and transition blending are computed frame-by-frame during compositing in OpenCV.

Some sections lean on AI-generated transitions to carry the scene change; others are built from stacked layers of independently generated clips, with black channels chroma-keyed out so the layers composite cleanly. In these stacked-layer sections, each composited layer corresponds to a distinct audio pattern — the visual becomes a spatial translation of the music.

The Final Composition

‘beyond the veil’ could not have been hand-animated. The final composition contains 26,144 audio-synced visual effect events across 1,702 unique scene transitions and 1,836 hand-curated keyframes. Any single transition in the piece — a figure dissolving into a landscape, a landscape reshaping into a cosmos — could easily represent a week of traditional animation and compositing work on its own. Multiplied across the scope of the piece, a team of 100 traditional animators working for a year still likely would not complete it.

It also could not have been made by AI alone — the film is shaped, frame by frame, by a human in the loop: curating candidates out of 15,243 generated options, guiding and hand-authoring prompts for specific moments, creating time-remap curves to sync visual moments to musical moments, and composing layers of chroma-keyed video into coherent imagery. Every layer of the pipeline above exists in service of a single division of labor: the machines generate a vast space of possibility, and I choose the final film from inside it.

The result is imagery that is not decorated onto music but grown out of it — a 35-minute moving painting in which every change on screen is something the track actually did.