I've wanted to automate this since 2019 (after first hearing about it in the popular podcast - Accidental Tech Podcast). I figured I'd write it in Kotlin (being my language of choice) first, but JVM audio processing wasn't there (or more fairly I just needed to put in way more work than I realized).
With AI ofc, I took another shot at it recently and finally built it in Rust.
"PodSync" takes a master track and individual participant tracks, finds the time offset for each using VAD (voice activity detection), MFCC fingerprinting, and cross-correlation, then outputs aligned WAV files. Drop them into your DAW at 0:00 and they line up!
There's an accompanying blog post with a visual on the mechanics: https://kau.sh/blog/podsync/
Would love to hear feedback!