Using the latest Seedance 2 model, which is incredibly powerful, you can input a reference image along with detailed descriptions of beat timings and dance moves, and it generates high-quality shots with a director’s sense of framing. I hardly had to do any rerolls, especially considering the length of the song.
Each segment can generate up to 15 seconds, but I made a silly mistake! It turns out the "full reference" feature supports all media formats—I could have input the music along with the visuals and generated lip-syncing in one go… I ended up overcomplicating things and had to manually sync the lip movements afterward. Still, I’m pretty happy with how it turned out.
To clarify, I didn’t use any real human dance footage as reference for this video—everything was generated and then edited together. Each segment of my video is based on prompts that generally include the following elements:1. Overall atmosphere description 2. Key actions 3. Scene description: starting pose, mid-sequence body/hand movements over time, and ending pose 4. Dialogue/lyrics/sound effects at specific timestamps
Seedance 2 automatically designs camera angles based on the content, though you can also specify camera movements precisely. In the raw clip below, I didn’t describe camera angles. After generating the clips, I edited them by adding lip-sync, syncing them with the music, and adjusting the speed of some segments to match the beat.
This was a habitual mistake I made while working on this video. Initially, I followed the traditional workflow for video models: first generating reference images, then describing the actions, and so on. However, Seedance supports up to 9 images, 3 video clips, and 3 audio clips as reference materials simultaneously for each generated segment.
This multimodal reference capability is quite rare among current AI video tools. In theory, I could have directly provided the model with edited music or voice clips along with reference images for generation. But for this project, I generated the clips first and then re-generated them to add lip-sync.