The tennis video, as other commented, is good but there is a noticeable delay between the action and the sound. And the "loving couple holding IA hands and then dancing", well, the input is already cringe enough.
For all these diffusion models, look like we are 90% here, now we just need the final 90%.
- The woman playing what I think was an Erhu[1] seemed to be imitating traditional music played by that instrument, but really badly (it sounded much more like a human voice than the actual instrument does). Also, I'm not even sure if it was able to tell which instrument it was, or if it was picking up on other cues from the video (which could be problematic, e.g. if it profiles people based on their race and attire)
- Most of the sound was pretty delayed from the visual cues. Not sure why
- The nature sounds were pretty muddy
- (I realize this is from video to music, but) the video with pumping upbeat music set to the text "Maddox White witnessed his father getting butchered by the Capo of the Italian mob" was almost comically out of touch with the source
Nevertheless, it's an interesting demo and highlights more applications for AI which I'm expecting we'll see massive improvements in over the next few years! So despite the shortcomings I agree it's still quite impressive.
It's usually "generate a few, one of them is not terrible, none are exactly what I wanted" then modify the prompt, wait an hour or so ...
The workflow reminds me of programming 30 years ago - you did something, then waited for the compile, see if it worked, tried something else...
All you've got are a few crude tools and a bit of grit and patience.
On the i2v tools I've found that if I modify the input to make the contrast sharper, the shapes more discrete, the object easier to segment, then I get better results. I wonder if there's hacks like that here.
Well sure... if your compiler was the equivalent of the Infinite Improbability Drive.
I assume you're referring to the classic positive/negative prompts that you had to attach to older SD 1.5 workflows. From the examples in the repo as well as the paper, it seems like AudioX was trained to accept relatively natural english using Qwen2.
These are both released this month.
What I'd like to see is some kind of i2i with multiple i input and guidance
So I can roughly sketch, and I don't mean controlnet or anything where I'm dealing with complex 3d characters, but give some kind of destination - and I don't mean the crude stuff that inpainting gives ... none of these things are what I'm talking about.
i'm familiar with the comfyui workflows and stay pretty on top of things. I've used the krita and photoshop plugin and even have built a civitai mcp server for bringing in models. AFAIK nobody else has done this yet.
None of these are hands on in the right way.
If I really think about it, it feels just weird to me that fabricated metadata is supposed to be enough to yield an art. Metadata by definition do not contain data. It's management data artificially made to be as disconnected from data as possible. The connections left between the two are basically unsafe side effects.
I wish OpenAI and its followers quit setting bridges ablaze left and right, though I know it's tall order.
The song is https://www.youtube.com/watch?v=6K2U6SuVk5s
Like this: (Created by noted thespian Gymnos Henson)
https://specularrealms.com/wp-content/uploads/2024/11/Gorgon...
It's like the markings on the back of tiger's heads that simulate eyes to prevent predators from attacking it. I'm sure there used to be something that tigers benefited from having this defense for enough for it to survive encoding into their DNA, right?
So, what was it that encoded this fear response into us?
If enough predictive models are broken, people feel like they've gone crazy - various drugs and experiments demonstrate a lot of these factors.
The interesting thing about uncanny valley is that the stimuli are on a threshold, and humans are really good at picking up tiny violations of those expectations, which translates to unease or fear.
- Tennis clip => ball is strongly unsynced with hit
- Dark mood beach video, no one in the screen => very high audio mood, lots of laughter like if it was summer on a busy beach
- Music inpainting completely switching style of audio (e.g. on the siren)
- "Electronic music with some buildup" : the gen just turns the volume up ?
I guess we have still some road to cover, but it feels like early image generation with out of touch hands and visual features. At least the generation are not non-sensical at all