Improving performance of rav1d video decoder
305 points
13 days ago
| 17 comments
| ohadravid.github.io
| HN
mmastrac
13 days ago
[-]
The associated issue for comparing two u16s is interesting.

https://github.com/rust-lang/rust/issues/140167

reply
ack_complete
13 days ago
[-]
I'm surprised there's no mention of store forwarding in that discussion. The -O3 codegen is bonkers, but the -O2 output is reasonable. In the case where one of the structs has just been computed, attempting to load it as a single 32-bit load can result in a store forwarding failure that would negate the benefit of merging the loads. In a non-inlined, non-PGO scenario the compiler doesn't have enough information to tell whether the optimization is suitable.
reply
mshockwave
13 days ago
[-]
> In the case where one of the structs has just been computed, attempting to load it as a single 32-bit load can result in a store forwarding failure

It actually depends on the uArch, Apple silicon doesn't seem to have this restriction: https://news.ycombinator.com/item?id=43888005

> In a non-inlined, non-PGO scenario the compiler doesn't have enough information to tell whether the optimization is suitable.

I guess you're talking about stores and load across function boundaries?

Trivia: X86 LLVM creates a whole Pass just to prevent this partial-store-to-load issue on Intel CPUs: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Targ...

reply
Dylan16807
13 days ago
[-]
> In the case where one of the structs has just been computed, attempting to load it as a single 32-bit load can result in a store forwarding failure that would negate the benefit of merging the loads

Would that failure be significantly worse than separate loading?

Just negating the optimization wouldn't be much reason against doing it. A single load is simpler and in the general case faster.

reply
ack_complete
11 days ago
[-]
Usually, yeah, it's noticeably worse than using individual loads and stores as it adds around a dozen cycles of latency. This is usually enough for the load to light up hot in a sampling profile. It's possible for that extra latency to be hidden, but then in that case the extra loads/stores wouldn't be an issue either.
reply
heybales
13 days ago
[-]
The thing I like most about this is that the discussion isn't just 14 pages of "I'm having this issue as well" and "Any updates on when this will be fixed?" As a web dev, GitHub issues kinda suck.
reply
eterm
13 days ago
[-]
It was worse before emoji reactions were added and 90% of messages were literally just "+1"
reply
heybales
13 days ago
[-]
+1
reply
NoMoreNicksLeft
12 days ago
[-]
Wonder if it's a poor interface issue... if people could just click a button that says "me too" but didn't add a full comment but rather just added some minimal notation at the bottom of the comment that indicated their username, 1) would people use it and 2) would that be not overly-busy enough to not be annoying? It could even mute notifications for the me-toos.
reply
IshKebab
5 days ago
[-]
This seems like an area where LLMs would actually be extremely useful. You can manually mark comments as irrelevant. Why can't GitHub use AI to do it automatically? Or to highlight the "resolution" comment automatically? On very big issues it can take a non-trivial amount of time just to find out what the outcome was.
reply
rhdjsjebshjffn
13 days ago
[-]
This just seems to illustrate the complexity of compiler authorship. I am very sure c compilers are wble to address this issue any better in the general case.
reply
runevault
13 days ago
[-]
Keep in mind Rust is using the same backend as one of the main C compilers, LLVM. So if it is handling it any better that means the Clang developers handle it before it even reaches the shared LLVM backend. Well, or there is something about the way Clang structures the code that catches a pattern in the backend the Rust developers do not know about.
reply
rhdjsjebshjffn
13 days ago
[-]
I mean yea, i just view rust as the quality-oriented spear of western development.

Rust is absolutely an improvement over C in every way.

reply
vlovich123
13 days ago
[-]
The rust issue has people trying this with c code and the compiler generates the same issue. This will get fixed and it’ll help c and Rust code
reply
runevault
13 days ago
[-]
Out of curiosity just clang or gcc as well?
reply
josephg
12 days ago
[-]
I just tried it, and the problem is even worse in gcc.

Given this C code:

    typedef struct { uint16_t a, b; } pair;

    int eq_copy(pair a, pair b) {
        return a.a == b.a && a.b == b.b;
    }
    int eq_ref(pair *a, pair *b) {
        return a->a == b->a && a->b == b->b;
    }
Clang generates clean code for the eq_copy variant, but complex code for the eq_ref variant. Gcc emits pretty complex code in both variants.

For example, here's eq_ref from gcc -O2:

    eq_ref:
        movzx   edx, WORD PTR [rsi]
        xor     eax, eax
        cmp     WORD PTR [rdi], dx
        je      .L9
        ret
    .L9:
        movzx   eax, WORD PTR [rsi+2]
        cmp     WORD PTR [rdi+2], ax
        sete    al
        movzx   eax, al
        ret
Have a play around: https://c.godbolt.org/z/79Eaa3jYf
reply
renewiltord
13 days ago
[-]
Oh this stuff is what’s prompting the ffmpeg Twitter account to make a stand against Rust https://x.com/ffmpeg/status/1924137645988356437?s=46
reply
ZeroGravitas
12 days ago
[-]
I generally trust rbultje to benchmark correctly but the ravid tracking ticket has multithread numbers across multiple platforms that don't show that big a difference.

https://github.com/memorysafety/rav1d/issues/1294

Is that explained in replies? I only see the original tweet as I'm not logged in.

reply
ycomb_anon
11 days ago
[-]
Contributor dav1d reported that rav1d was lagging

https://code.videolan.org/videolan/dav1d/-/merge_requests/17...

reply
viraptor
12 days ago
[-]
reply
renewiltord
12 days ago
[-]
No. The replies are just language war stuff.
reply
mmastrac
13 days ago
[-]
Reading the ffmpeg twitter account is enough to turn me off using ffmpeg. It's a shame there's no real alternative -- the devs seem very toxic.

I mean sure, max performance is great if you control every part of your pipeline, but if you're accepting untrusted data from users-at-large ffmpeg has at least a half-dozen remotely exploitable CVEs a year. Better make sure your sandbox is tight.

https://ffmpeg.org/security.html

I feel like there's a middle ground where everyone works towards a secure and fast solution, rather than whatever position they've staked out here.

reply
saagarjha
12 days ago
[-]
Yeah, it used to be funny the first few times, then they fell into the trap of having a Twitter "personality" and now it's just annoying
reply
renewiltord
12 days ago
[-]
This is so true. They got a following and like many who suddenly get some sort of niche fame, they reoriented to serve the audience and it hasn't improved anything. The greatest damage that popularity does to many is that they lose themselves in the desire to hold on to it.
reply
izacus
13 days ago
[-]
I've worked with ffmpeg for literally a decade and I've never found them particularly toxic.

What I have found that they (as many others who do great work) have very little tolerance of random junior language fanboys criticizing their decades of work without even understanding what they're talking about and constantly throwing out silly rewrite ideas.

reply
mmastrac
13 days ago
[-]
I'm not saying that they don't do great work, but that twitter thread (https://x.com/ffmpeg/status/1924137645988356437) is pretty obnoxious and reads like they are upset they didn't get funding. It's entirely possible that they are just difficult to work with and funders _don't_ want to fund them.

"Because substantial amounts of human and financial resources go into these rust ports that are inferior to the originals. Orders of magnitude more resources than the originals which remain extremely understaffed/underfunded." -- https://x.com/FFmpeg/status/1924149949949775980

"... And we get this instead: <xz backdoor subtweet>" -- https://x.com/FFmpeg/status/1924153020352225790

"They [rust ports] are superior in the same way Esperanto is also superior to English." -- https://x.com/FFmpeg/status/1924154854051557494

It's kind of sad to see that snarky attitude. Clearly the corporate sponsors _want_ a more secure decoder. Maybe they should try and work _with_ the system instead of wasting energy on sarcasm on Twitter?

reply
hitekker
12 days ago
[-]
You’re right; this happens a lot.

The SQlite folks, half of Linux, and other maintainers have encountered the same kind of zealotry. Dealing with language supremacism is annoying and I don’t blame ffmpeg for venting.

In fact, I’d even say that twitter thread is informative, because it demonstrates out how big tech fund their own pet projects over the actual maintainers.

reply
oguz-ismail
13 days ago
[-]
>Reading the ffmpeg twitter account is enough to turn me off using ffmpeg.

What's the alternative?

reply
mmastrac
12 days ago
[-]
There is not much, unless you're working with AV1. rav1d is the alternative there but you've got to trade off some performance for security gains.

ffmpeg is a monopoly in the space which means that you either take the exact set of tradeoffs they offer, or... well, you have no alternatives, so take it.

Of course the alternatives are never going to be as good as the originals until they've had more effort put into them. It took _years_ until the Rust gzip/zip libraries surpassed the C ones while being more secure overall.

reply
throwaway94487
13 days ago
[-]
How many of those "remotely exploitable CVEs" have actually been exploited in the wild? Quite a few are denial-of-service and memory leak CVEs too, which Rust doesn't consider to be unsafe.
reply
saagarjha
12 days ago
[-]
More than enough are exploitable for this to be a problem.
reply
tialaramex
13 days ago
[-]
The healthier response might have been work to speed-up dav1d. If you refine the Olympic Record metrics and force them to retrospectively update previous records so that Bolt's 100m sprint record is revised to 9.64s rather than 9.63s nobody cares man, get a life, but if you can run an actual nine second 100 metre sprint that people care about†

† If you're a human. If you're an ostrich this is not impressive, but on the whole ostrichs aren't competing in the Olympic 100 metre sprint.

reply
nemothekid
13 days ago
[-]
Intersting to see this article on the perfromance advantage of not having to zero buffers after this article 2 days ago: https://news.ycombinator.com/item?id=44032680
reply
brookst
13 days ago
[-]
Title undersells post; it’s actually 2.3% faster with two good optimizations.
reply
ohr
13 days ago
[-]
I think that since the 1.5% one is only for aarch64 it's a bit unfair to claim the full number, more like 1/2 if you consider arm/x86 to be the majority of the (future) deployments
reply
brookst
13 days ago
[-]
I suppose that’s fair, but I’d give credit for a 2.3% improvement in the test environment. For all we know it may be a net loss in other environments due to quirks (probably not, admittedly).
reply
robertknight
13 days ago
[-]
Good post! The inefficient code for comparing pairs of 16-bit integers was an interesting find.
reply
ohr
13 days ago
[-]
Thanks! Would be interesting to see if Rust/LLVM folks can get the compiler to apply this optimization whenever possible, as Rust can be much more accurate w.r.t memory initialization.
reply
adgjlsfhk1
13 days ago
[-]
I think rust may be able to get it by adding a `freeze` intrinsic to the codegen here. that would force LLVM to pick a deterministic value if there was poison, and should thus unblock the optimization (which is fine here because we know the value isn't poison)
reply
kukkamario
13 days ago
[-]
I think in this case Rust and C code aren't equivalent which maybe caused this slow down. Union trick also affects the alignment. C side struct is 32 bit aligned, but Rust struct only has 16bit alignment because it only contains fields with 16bit alignment. In practice the fields are likely anyway correctly aligned to 32bits, but compiler optimizations may have hard time verifying that.

Have you tried manually defining alignment of Rust struct?

reply
Ygg2
13 days ago
[-]
Would be great, but wouldn't hold my breath for it. LLVM and Rustc can be both be kinda slow to stabilize.
reply
pornel
13 days ago
[-]
It varies. New public APIs or language features may take a long time, but changes to internals and missed optimizations can be fixed in days or weeks, in both LLVM and Rust.
reply
tialaramex
13 days ago
[-]
All being equal codecs ought to be in WUFFS† rather than Rust, but I can well imagine that it's a much bigger lift to take something as complicated as dav1d and write the analogous WUFFS than to clean up the c2rust translation, if you said a thousand times harder I'd have no trouble believing that. I just think it's worth it for us as a civilisation.

† Or an equivalent special purpose language, but WUFFS is right there

reply
IgorPartola
13 days ago
[-]
WUFFS would be great for parsing container files (Matroska, webm, mp4) but it does not seem at all suitable for a video decoder. Without dynamic memory allocation it would be challenging to deal with dynamic data. Video codecs are not simply parsing a file to get the data, they require quite a bit of very dynamic state to be managed.
reply
lubesGordi
13 days ago
[-]
Requiring dynamic state seems not obvious to me. At the end of the day you have a fixed number of pixels on the screen. If every single pixel changes from frame to frame that should constitute the most work your codec has to do, no? I'm not a codec writer but that's my intuition based on the assumption that codecs are basically designed to minimize the amount of 'work' being done from frame to frame.
reply
IgorPartola
13 days ago
[-]
If you are doing something like a GIF or an MJPEG, sure. If you are doing forwards and backwards keyframes with a variable amount of deltas in between, with motion estimation, with grain generation, you start having a very dynamic amount of state. Granted, encoders are more complex than decoders in some of this. But still you might need to decode between 1 and N frames to get the frame you want, and you don't know how much memory it will consume once it is decoded unless you decode it into bitmaps (at 4k that would be over 8MB per frame which very quickly runs out of memory for you if you want any sort of frame buffer present).

I suspect the future of video compression will also include frame generation, like what is currently being done for video games. Essentially you have let's say 12 fps video but your video card can fill in the intermediate frames via what is basically generative AI so you get 120 fps output with smooth motion. I imagine that will never be something that WUFFS is best suited for.

reply
derf_
13 days ago
[-]
> But still you might need to decode between 1 and N frames to get the frame you want, and you don't know how much memory it will consume...

All of these things are bounded for actual codecs. AV1 allows storing at most 8 reference frames. The sequence header will specify a maximum allowable resolution for any frame. The number of motion vectors is fixed once you know the resolution. Film grain requires only a single additional buffer. There are "levels" specified which ensure interoperability at common operating points (e.g., 4k) without even relying on the sequence header (you just reject sequences that fall outside the limits). Those are mostly intended for hardware, but there is no reason a software decoder could not take advantage of them. As long as codecs are designed to be implemented in hardware, this will be possible.

reply
GuB-42
13 days ago
[-]
> I suspect the future of video compression will also include frame generation

That's how most video codecs work already. They try to "guess" what the next frame will be, based on past (for P-frames) and future (for B-frames) frames. The difference is that the codec encodes some metadata to help with the process and also the difference between the predicted frame and the real frame.

As for using AI techniques to improve prediction, it is not a new thing at all. Many algorithms optimized for compression ratio use neural nets, but these tend to be too computationally expensive for general use. In fact the Hutter prize considers text compression as an AI/AGI problem.

reply
lubesGordi
13 days ago
[-]
See this is interesting to me. I understand the desire to dynamically allocate buffers at runtime to capture variable size deltas. That's cool, but also still maybe technically unnecessary? Because like you say, at 4k and over 8MB per frame; you still can't allocate over a limit. So likely a codec would have some boundary set on that anyway. Why not just pre-allocate at compile time? For sure this results in a complex data structure. Functionally it could be the same and we would elide the cost of dynamic memory allocations. What I'm suggesting is probably complex, I'm sure.

In any case I get what you're saying and I understand why codecs are going to be dynamically allocating memory, so thanks for that.

reply
zimpenfish
13 days ago
[-]
> codecs are basically designed to minimize the amount of 'work' being done from frame to frame

But to do that they have to keep state and do computations on that state. If you've got frame 47 being a P frame, that means you need frame 46 to decode it correctly. Or frame 47 might be a B frame in which case you need frame 46 and possibly also frame 48 - which means you're having to unpack frames "ahead" of yourself and then keep them around for the next decode.

I think that all counts as "dynamic state"?

reply
wtallis
13 days ago
[-]
Memory usage can vary, but video codecs are designed to make it practical to derive bounds on those memory requirements because hardware implementations don't have the freedom to dynamically allocate more silicon.
reply
dylan604
13 days ago
[-]
Maybe you're not familiar with how long GOP encoding works with IPB frames? If all frames were I-frames, maybe what you're thinking might work. Everything you need is in the one frame to be able to describe every single pixel in that frame. Once you start using P-frames, you have to hold on to data from the I-frame to decode the P-frame. With B-frames, you might need data from frames not yet decoded as the are bi-direction references.
reply
lubesGordi
13 days ago
[-]
Still you don't necessarily need to have dynamic memory allocations if the number of deltas you have is bounded. In some codecs I could definitely see those having a varying size depending on the amount of change going on in the scene.

I'm not a codec developer, I'm only coming at this from an outside/intuitive perspective. Generally, performance concerned parties want to minimize heap allocations, so I'm interested in this as how it applies in codec architecture. Codecs seem so complex to me, with so much inscrutable shit going on, but then heap allocations aren't optimized out? Seems like there has to be a very good reason for this.

reply
izacus
13 days ago
[-]
You're actually right about allocation - most video codecs are written with hardware decoders in mind which have fixed memory size. This is why their profiles hard limit the memory constraints needed for decode - resolution, number of reference frames, etc.

That's not quite the case for encoding - that's where things get murky since you have way more freedom at what you can do to compress better.

reply
Sesse__
13 days ago
[-]
The very good reason is that there's simply not a lot of heap allocations going on. It's easy to check; run perf against e.g. ffmpeg decoding a big file to /dev/null, and observe the distinct lack of malloc high up in the profile.

There's a heck of a lot of distance from “not a lot” to “zero”, though.

reply
throwawaymaths
13 days ago
[-]
compression algorithms can get very clever in recursive ways
reply
lubesGordi
13 days ago
[-]
Hey maybe we can discuss why I'm being downvoted? This is a technical discussion and I'm contributing. If you disagree then say why. I'm not stating anything as fact that isn't fact. I am getting downvoted for asking a question.
reply
mbeavitt
13 days ago
[-]
Haha I was just thinking to myself "I wonder if anyone made any progress on that rav1d bounty yet?"
reply
infogulch
13 days ago
[-]
You know it's a good post when it starts with a funny meme. Seems related to the recent discussion: $20K Bounty Offered for Optimizing Rust Code in Rav1d AV1 Decoder (memorysafety.org) | 108 comments | https://news.ycombinator.com/item?id=43982238
reply
HappyPanacea
13 days ago
[-]
A clear case of Nominative determinism!
reply
lubesGordi
13 days ago
[-]
Honestly its a little surprising the first optimization he found was something fairly obvious just by using perf. I thought they had discussed the zeroing buffers issue in the first post? The second optimization was definitely more involved/interesting but was still pointed at by perf. Don't underestimate that tool!
reply
Sesse__
13 days ago
[-]
AFAICS, it wasn't “just perf”; it was doing a differential profile between the C and Rust versions, with manual matching up. (perf diff exists, but can't match across the differing symbol names, and few people seem to use it.)
reply
sounds
13 days ago
[-]
He came from the aarch64 perspective on an Apple device. I often experience someone spotting an "obvious in hindsight" gap because they come from a different background.
reply
smallpipe
13 days ago
[-]
This is really fun. Is there anything stopping rustc from performing the transmute trick ?

Edit: If I had read the next paragraph, I'd have learn about [1] before commenting

[1] https://github.com/rust-lang/rust/issues/140167

reply
Mr_Eri_Atlov
13 days ago
[-]
AV1 continues to be the most fascinating development in media encoding.

AVG-SVT-PSY is particularly interesting to read up on as well.

reply
jebarker
13 days ago
[-]
Beautiful work and nice write-up. Profiling and optimization is absolutely my favorite part of software development.
reply
anon-3988
13 days ago
[-]
Is skipping initialization of buffers a hard problem for compilers?
reply
brigade
13 days ago
[-]
It’s especially hard to elide the compiler initialization when the intended initialization is by a function written in assembly
reply
adgjlsfhk1
13 days ago
[-]
yeah. Proving that the zero initialization is useless requires proving that the rest of the program never reads one of the zeroed values. This is really difficult because compilers generally don't track individual array indices (since you often don't even know how big the array is)
reply
empath75
13 days ago
[-]
It's easy to not initialize the buffer, the hard part is guaranteeing that it's safe to read something that might not be initialized.
reply
mastax
13 days ago
[-]
In this case I assume the difficulty is the initialization happens in assembly which the compiler has no visibility into.
reply
sylware
12 days ago
[-]
I don't understand this project. dav1d is 99% assembly (x86_64/risc-v 64bits/etc) with very little simple and plain C as coordinating code.
reply
canucker2016
12 days ago
[-]
The read-only dav1d github repo says 79.8% assembly, 19.7% C language, 0.5% other.

see https://github.com/videolan/dav1d

reply
sylware
12 days ago
[-]
omg... how is this possible to miss that much the point?
reply
saagarjha
12 days ago
[-]
I am very curious what you did to embed the profiler results into your blog post. Literally copy the HTML nodes?
reply
ohr
11 days ago
[-]
Used the Save Page WE Chrome extension to capture the html (after cleaning up with inspect element + delete), and added a bit of custom JavaScript to scroll everything to the right place. Needed the extension for the styling to be captured correctly.
reply
mdf
13 days ago
[-]
There's something about real optimization stories that I find fascinating – particularly the detailed ones including step-by-step improvements and profiling to show how numbers got better. In some way, they are satisfying to read.

Nicholas Nethercote's "How to speed up the Rust compiler" writings[1] fall into this same category for me.

Any others?

[1] https://nnethercote.github.io/

reply
ohr
13 days ago
[-]
(Author here) I'm a huge fan of the "How to speed up the Rust compiler" series! I was hoping to capture the same feeling :)
reply
dirtyhippiefree
13 days ago
[-]
Having your last name be Ravid really is the icing on your cake.

Real is about the only other codec I see that could be a name, but nobody uses that anymore.

reply
aidenn0
13 days ago
[-]
Do your part: name your kids "ffmpeg" and "vp-IX"!
reply
Voultapher
13 days ago
[-]
Since you seem to enjoy this kind of writing I'd love to get your feedback on something I've written a while back about branchless partitioning [1]. Despite it being content wise the most work to create of the things I've written about the topic, it found much less attention than other things I've written. So far I've wondered if it was maybe too technical? Would love to get an honest opinion.

[1] https://github.com/Voultapher/sort-research-rs/blob/main/wri...

reply
mdf
12 days ago
[-]
Just finished reading your linked article. I found it interesting and I experienced similar excitement from the results as mentioned up-thread. There were some new things I learned, too.

I wouldn't say your article is too technical; it does go a bit deeper into details, but new concepts are explained well and at a level I found suitable for myself. Having said that, several times I felt that the text was a bit verbose. Using more succinct phrasing needs, of course, a lot of additional effort, but… I guess it's a kind of an optimization as well. :)

reply
Voultapher
11 days ago
[-]
Thx for taking the time and glad to hear you enjoyed it. I keep being impressed by people like Cory Doctorow that can express nearly every sentence they write extremely succinctly and on the point. That's something I aspire to, so hopefully next time I'm a little better at it :)
reply
dpacmittal
13 days ago
[-]
I read an article a while ago where the goal is to process a file as fast as possible and the article talks about compressing the data chunks so they fit in L1 cache. The cache misses were slower than compressing and decompressing the data from L1 cache.

I've been trying to find that article ever since but I'm not able to. Anyone knows the article I'm talking about?

reply
jms55
13 days ago
[-]
reply
IgorPartola
13 days ago
[-]
AV1 is an amazing codec. I really hope it replaces proprietary codecs like h264 and h265. It has a similar, if not better, performance to h265 while being completely free. Currently on an Intel-based Macbook it is only supported in some browsers, however it seems that newer video cards from AMD, Nvidia, and Intel do include hardware decoders.
reply
flashblaze
13 days ago
[-]
I'm not really well versed with codecs, but is it up to the devices or the providers (where you're uploading them) to handle playback or both? A couple of days ago, I tried to upload an Instagram Reel in AV1 codec, and I was struggling to preview it on my Samsung S20 FE Snapdragon version (before uploading and during preview as well). I then resorted to H.264 and it worked w/o any issues.
reply
sparrc
13 days ago
[-]
Playback is 100% handled by the device. The primary (and essentially only) benefit of H.264 is that almost every device in the entire world has an H.264 hardware decoder builtin to the chip, even extremely cheap devices.

AV1 hardware decoders are still rare so your device was probably resorting to software decoding, which is not ideal.

reply
kevmo314
13 days ago
[-]
Instagram (the provider) will transcode for compatibility but likely the preview is before transcoding, the assumption being that the device that uploads the video is able to play it.
reply
ta1243
13 days ago
[-]
Yes that sounds spot on.

I don't know instagram, but I would expect any provider to be handle almost any container/codec/resolution combination going (they likely use ffmpeg underneath) and generate their different output formats at different bitrates for different playback devices.

Either instagram won't accept av1 (seems unlikely) or they just haven't processed it yet as you infer.

I'd love to know why your commend is greyed out.

reply
karn97
13 days ago
[-]
9070xt records gameplay by default in av1
reply
monster_truck
13 days ago
[-]
RDNA3 cards also have AV1 encode. RDNA 2 only has decode.

With the bitrate set to 100MB/s it happily encodes 2160p or even 3240p, the maximum resolution available when using Virtual Super Resolution (which renders at >native res and downsamples, is awesome for titles without resolution scaling when you don't want to use TAA)

reply
kennyadam
13 days ago
[-]
Isn't that expected? 4K Blurays only encode up to like 128Mbps, which is 16MB/s. 100MB/s seems like complete overkill.
reply
vlovich123
13 days ago
[-]
I think op just didn’t type Mbps properly. 100MB/s or ~800Mbps is way higher than the GPU can even encode at a HW level even I would think
reply
monster_truck
12 days ago
[-]
100,000kbps. It will more than double that for 3240p.

https://i.imgur.com/LyrhNXZ.png

reply
vlovich123
12 days ago
[-]
Right. That’s 223642 kilobits/s (kbps) in your picture or ~200MBit/s whereas you wrote (intentionally or otherwise) 200Mbyte/s a nearly 10 fold difference (100Mbit/s =~ 12Mbyte/s). 100MByte/s is 800Mbit/s or ~800000 kbps which is an order of magnitude more insanity than already choosing 100Mbit/s for live streaming (and not physically possible on consumer GPUs I believe).
reply
monster_truck
12 days ago
[-]
It isn't for the amount of motion involved. Third person views in rally simulators and continuous fast flicks in shooters require it
reply
rasz
12 days ago
[-]
Is the encoder any better than previous AMD offerings?

https://goughlui.com/2024/01/07/video-codec-round-up-2023-pa...

reply
adzm
13 days ago
[-]
Isn't VP9 more comparable to h265? AV1 seems to be a ton better than both of them.
reply
senfiaj
13 days ago
[-]
I think VP9 is more comparable to h264. Also if I'm not mistaken it's not good for live streaming, only for storing data.
reply
toast0
13 days ago
[-]
VP9 works for live streaming/real time conferencing too.
reply
senfiaj
13 days ago
[-]
Yeah, but I think it has much higher CPU usage, at least when there is no native hardware decoder/encoder. Maybe this has more to do with adoption, since H264 has been an industry standard.
reply
toast0
13 days ago
[-]
Codec selection is always a complex task. You've got to weigh quality/bitrate vs availability of hardware encode/decode, licensing, and overall resource usage.

The ITU standards have had a lot better record of inclusion in devices that people actually have; and often using hardware encode/decode takes care of licensing. But hardware encode doesn't always have the same quality/bitrate as software and may not be able to do fancier things like simulcast or svc. Some of the hardware decoders are pretty picky about what kinds of streams they'll accept too.

IMHO, if you're looking at software h.264 vs software vp9, I think vp9 is likely to give you better quality at a given bitrate, but will take more cpu to do it. So, as always, it depends.

reply
Dylan16807
12 days ago
[-]
> IMHO, if you're looking at software h.264 vs software vp9, I think vp9 is likely to give you better quality at a given bitrate, but will take more cpu to do it. So, as always, it depends.

That's a pretty messy way to measure. h.264 with more CPU can also beat h.264 with less CPU.

How does the quality compare if you hold both bitrate and CPU constant?

How does the CPU compare if you hold both bitrate and quality constant?

AV1 will do significantly better than h.264 on both of those tests. How does VP9 do?

reply
dagmx
13 days ago
[-]
They’re all in the same ballpark of each other and have characteristics that don’t make one an outright winner.
reply
CharlesW
13 days ago
[-]
AV1 is the outright winner in terms of compression efficiency (until you start comparing against VVC/H.266¹), with the advantage being even starker at high resolutions. The only current notable downside of AV1 is that client hardware support isn't yet universal.

¹ https://www.mdpi.com/2079-9292/13/5/953

reply
aaron695
13 days ago
[-]
Get The Scene involved.

They shifted to h.264 successfully, but I haven't heard of any more conferences to move forward in over a decade.

Currently "The Last of US S02E06" only has one AV1 - https://thepiratebay.org/search.php?q=The+Last+of+Us+S02E06 same THMT - https://thepiratebay.org/search.php?q=The+Handmaids+Tale+S06... These are low quality at only ~600MB, not really early adopter sizes.

AV1 beats h.265 but not h.266 - https://www.preprints.org/manuscript/202402.0869/v1 - People disagree with this paper on default settings

Things like getting hardware to The Scene for encoding might help, but I'm not sure of the bottleneck, it might be bureaucratic or educational or cultural.

[edit] "Common Side Effects S01E04" AV1 is the strongest torrent, that's cool - https://thepiratebay.org/search.php?q=Common+Side+Effects+S0...

reply
aidenn0
13 days ago
[-]
At higher quality/bitrates, the difference is much smaller and device support is universal for AVC and quite good for HEVC. Anything over 1.5GB for a single episode would probably only be farily similarly sized with AV1.

There is one large exception, but I don't know the current scene well enough to know if it matters: sources that are grainy. I have some DVD and blurays with high grain content and AV1 can work wonders with those thanks to the in-loop grain filter and synthesis -- we are talking half the size for a high-quality encode. If I were to encode them for AVC at any reasonable bitrate, I would probably run a grain-removal filter which is very finicky if you don't want to end up with something that is overly blurry.

reply
LtdJorge
13 days ago
[-]
This may be in part because people that automatized their media servers are using hardware acceleration for transcoding (from 4k for example), and hardware has only recently added decoding for AV1.

In my case, I get both 4k (h265) and 1080p (h264) blurays and let the client select.

reply
phendrenad2
13 days ago
[-]
Holy shadowban Batman! All of your comments are [dead]. What did you do to anger the HN Gods?
reply
fishgoesblub
13 days ago
[-]
There are plenty of AV1 releases in other, better places than the scam bay.
reply
wbl
13 days ago
[-]
There was a conference?!
reply
kasabali
11 days ago
[-]
Sponsored by FBI probably :p
reply