Fast forward several years, and the cryptocurrency craze drove up GPU prices for many years without even touching the floating-point capabilities. Now, FP64 is out because of ML, a field that's almost unrecognizable compared to where it was during the first few years of CUDA's existence.
NVIDIA has been very lucky over the course of their history, but have also done a great job of reacting to new workloads and use cases. But those shifts have definitely created some awkward moments where their existing strategies and roadmaps have been upturned.
Some of the near misses I remember included bitcoin. Many of the other attempts didn't ever see the light of day.
Luck in english often means success by chance rather than one's own efforts or abilities. I don't think that characterizes CUDA. I think it was eventual success in the face of extreme difficulty, many failures, and sacrifices. In hindsight, I'm still surprised that Jensen kept funding it as long as he did. I've never met a leader since who I think would have done that.
Luck is when preparation meets opportunity.
When they couldn't deliver the console GPU they promised for the Dreamcast (the NV2), Shoichiro Irimajiri, the Sega CEO at the time let them keep the cash in exchange for stock [0].
Without it Nvidia would have gone bankrupt months before Riva 128 changed things.
Sega console arm went bust not that it mattered. But they sold the stock for about $15mn (3x).
Had they held it, Jensen Huang ,estimated itd be worth a trillion[1]. Obviously Sega and especially it's console arm wasn't really into VC but...
My wet dream has always been what if Sega and Nvidia stuck together and we had a Sega tegra shield instead of a Nintendo switch? Or even what if Sega licensed itself to the Steam Deck? You can tell I'm a sega fan boy but I can't help that the Mega Drive was the first console I owned and loved!
[0] https://www.gamespot.com/articles/a-5-million-gift-from-sega...
I remember ATI and Nvidia were neck-and-neck to launch the first GPUs around 2000. Just so much happening so fast.
I'd also say Nvidia had the benefit of AMD going after and focusing on Intel both at the server level as well as the integrated laptop processors, which was the reason they bought ATI.
Let's say X=10% of the GPU area (~75mm^2) is dedicated to FP32 SIMD units. Assume FP64 units are ~2-4x bigger. That would be 150-300mm^2, a huge amount of area that would increase the price per GPU. You may not agree with these assumptions. Feel free to change them. It is an overhead that is replicated per core. Why would gamers want to pay for any features they don't use?
Not to say there isn't market segmentation going on, but FP64 cost is higher for massively parallel processors than it was in the days of high frequency single core CPUs.
I'm pretty sure that's not a remotely fair assumption to make. We've seen architectures that can eg. do two FP32 operations or one FP64 operation with the same unit, with relatively low overhead compared to a pure FP32 architecture. That's pretty much how all integer math units work, and it's not hard to pull off for floating point. FP64 units don't have to be—and seldom have been—implemented as massive single-purpose blocks of otherwise-dark silicon.
When the real hardware design choice is between having a reasonable 2:1 or 4:1 FP32:FP64 ratio vs having no FP64 whatsoever and designing a completely different core layout for consumer vs pro, the small overhead of having some FP64 capability has clearly been deemed worthwhile by the GPU makers for many generations. It's only now that NVIDIA is so massive that we're seeing them do five different physical implementations of "Blackwell" architecture variants.
I'm not a hardware guy, but an explanation I've seen from someone who is says that it's not much extra hardware to add to a 2×f32 FMA unit the capability to do 1×f64. You already have all of the per-bit logic, you mostly just need to add an extra control line to make a few carries propagate. So the size overhead of adding FP64 to the SIMD units is more like 10-50%, not 100-300%.
Obviously they don't want to. Now flip it around and ask why HPC people would want to force gamers to pay for something that benefits the HPC people... Suddenly the blog post makes perfect sense.
NVIDIA GeForce RTX 3060 LHR which tried to hinder mining at the bios level.
The point wasn't to make the average person lose out by preventing them mining on their gaming GPU. But to make miners less inclined to buy gaming GPUs. They also released a series of crypto mining GPUs around the same time.
So fairly typical market segregation.
https://videocardz.com/newz/nvidia-geforce-rtx-3060-anti-min...
https://www.eatyourbytes.com/list-of-gpus-by-processing-powe...
Past a certain threshold of FP64 throughput, your chip goes in a separate category and is subject to more regulation about who you can sell to and know-your-customer. FP32 does not matter for this threshold.
https://en.wikipedia.org/wiki/Adjusted_Peak_Performance
It is not a market segmentation tactic and has been around since 2006. It's part of the mind-numbing annual export control training I get to take.
I do think though that Nvidia generally didn't see much need for more FP64 in consumer GPUs since they wrote in the Ampere (RTX3090) white paper: "The small number of FP64 hardware units are included to ensure any programs with FP64 code operate correctly, including FP64 Tensor Core code."
I'll try adding an additional graph where I plot the APP values for all consumer GPUs up to 2023 (when the export control regime changed) to see if the argument of Adjusted Peak Performance for FP64 has merit.
Do you happen to know though if GPUs count as vector processors or not under these regulations since the weighing factor changes depending on the definition?
https://www.federalregister.gov/documents/2018/10/24/2018-22... What I found so far is that under Note 7 it says: "A ‘vector processor’ is defined as a processor with built-in instructions that perform multiple calculations on floating-point vectors (one-dimensional arrays of 64-bit or larger numbers) simultaneously, having at least 2 vector functional units and at least 8 vector registers of at least 64 elements each."
Nvidia GPUs have only 32 threads per warp, so I suppose they don't count as a vector processor (which seems a bit weird but who knows)?
Only two of these examples meet the definition of vector processor, and these are very clearly classical vector processor computers, the Cray X1E and the NEC SX-8 (as in, if you're preparing a guide on historical development of vector processing, you're going to be explicitly including these systems or their ancestors as canonical examples of what you mean by a vector super computer!). And the definition is pretty clearly tailored to make sure that SIMD units in existing CPUs wouldn't qualify for the definition of vector processor.
The interesting case to point out is the last example, a "Hypothetical coprocessor-based Server" which hypothetically describes something that is actually extremely similar to the result of GPGPU-based HPC systems: "The host microprocessor is a quad-core (4 processors) chip, and the coprocessor is a specialized chip with 64 floating-point engines operating in parallel, attached to the host microprocessor through a specialized expansion bus (HyperTransport or CSI-like)." This hypothetical system is not a "vector processor," it goes on to explain.
From what I can find, it seems that neither NVidia nor the US government considers the GPUs to count as vector processors and thus give it the 0.3 rather than the 0.9 weight.
I’d say it’s better than theory, you can definitely use float2 pairs of fp32 floats to emulate higher precision. Quad precision using too, using float4. Here’s the code: https://andrewthall.com/papers/df64_qf128.pdf
Also note it’s easy to emulate fp64 using entirely integer instructions. (As a fun exercise, I attempted both doubles and quads in GLSL: https://www.shadertoy.com/view/flKSzG)
While it’s relatively easy to do, these approaches are a lot slower than fp64 hardware. My code is not optimized, not ieee compliant, and not bug-free, but the emulated doubles are at least an order of magnitude slower than fp32, and the quads are two order of magnitude slower. I don’t think Andrew Thall’s df64 can achieve a 1:4 float to double perf ratio either.
And not sure, but I don’t think CUDA SMs are vector processors per se, and not because of the fixed warp size, but more broadly because of the design & instruction set. I could be completely wrong though, and Tensor Cores totally might count as vector processors.
Myelin is TensorRT's internal graph compiler. Source tree structure:
``` /nvidia/gpgpu/MachineLearning/myelin/src/ ├── api_wrap/ │ ├── graph_wrap.cpp │ ├── op_wrap.cpp │ ├── op_wrap.h │ └── tensor_wrap.cpp ├── common/ │ ├── cask6_base.cpp # CASK 6 = Blackwell │ ├── cask6_common.cpp │ ├── cask6_launched_shader.cpp │ ├── cask_base.cpp │ ├── cask_common.cpp │ ├── convolution_common.cpp │ ├── device_utils.cpp │ ├── myelin_binary.cpp │ ├── myelin_operation.cpp │ ├── myelin_session.cpp │ └── ... ├── compiler/ │ ├── analysis/ │ │ ├── alias.cpp │ │ ├── dom.cpp │ │ ├── graph_io_alias.cpp │ │ ├── implicit_padding.cpp │ │ ├── l2_cache_management.cpp │ │ ├── loops.cpp │ │ ├── shape.cpp │ │ ├── ssa.cpp │ │ ├── tensorify.cpp │ │ ├── type.cpp │ │ └── verify.cpp │ ├── codegen/ │ │ ├── codegen.cpp │ │ ├── data.cpp │ │ └── ops.cpp │ ├── global_allocator/ │ │ ├── block_allocator.cpp │ │ ├── happens_before.cpp │ │ ├── interference_graph.cpp │ │ ├── list_allocator.cpp │ │ ├── live_analysis.cpp │ │ ├── live_interval.cpp │ │ ├── memory_allocator.cpp │ │ ├── overlap.cpp │ │ ├── region_allocator.cpp │ │ ├── reorder_ops_to_reduce_live_set.cpp │ │ └── tensor_allocator.cpp │ ├── ir/ │ │ ├── explicit_dds.cpp │ │ ├── multidef_map_t.cpp │ │ └── operation/pointwise_op.cpp │ ├── ir_builder/ │ │ └── ir_sequence_builder.cpp │ ├── kernel_gen/ │ │ ├── align_stride_order.cpp │ │ ├── align_stride_order_heuristic.cpp │ │ ├── convert_move_to_transpose.cpp │ │ ├── dag.cpp │ │ ├── kernel_gen.cpp │ │ ├── kernel_gen_ds.cpp │ │ ├── kernel_gen_utils.cpp │ │ ├── kgen.cpp │ │ ├── kgen_prelim_transforms.cpp │ │ ├── knode.cpp │ │ ├── myl_fusion_heuristics.cpp │ │ ├── partition_dag.cpp │ │ ├── cuda_codegen/ │ │ │ ├── crd_computation.cpp │ │ │ ├── cuda_codegen.cpp # 9000+ lines │ │ │ ├── cuda_codegen_fc.cpp │ │ │ ├── cuda_codegen_flash_decode.cpp │ │ │ ├── cuda_codegen_pattern.cpp │ │ │ ├── cuda_codegen_utils.cpp │ │ │ ├── cuda_compile.cpp │ │ │ ├── cuda_dynq_op.cpp │ │ │ ├── cuda_norm_op.cpp │ │ │ ├── cuda_reduce_op.cpp │ │ │ ├── cuda_tma_gnorm_op.cpp │ │ │ ├── generate_block_group_norm_kernel.cpp │ │ │ ├── generate_block_inst_norm_kernel.cpp │ │ │ ├── generate_cumsum_kernel.cpp │ │ │ ├── generate_dynq_kernel.cpp │ │ │ ├── generate_norm_kernel.cpp │ │ │ ├── generate_reduction_kernel.cpp │ │ │ ├── generate_tma_group_norm_kernel.cpp │ │ │ ├── nvcc_compile.cpp │ │ │ ├── nvrtc_compile.cpp │ │ │ ├── op_scheduler.cpp │ │ │ ├── permutation_utils.cpp │ │ │ ├── source_guard.cpp │ │ │ ├── stream_tile_size_solver.cpp │ │ │ ├── vectorizer.cpp │ │ │ └── xqa_codegen.cpp # XQA = Cross-Query Attention │ │ ├── fusion_codegen/ │ │ │ ├── backbone_fuser.cpp │ │ │ ├── cask_codegen.cpp │ │ │ ├── fir_builder.cpp │ │ │ └── fusion_codegen.cpp │ │ └── mlir_codegen/ │ │ ├── cask6_api.cpp # Blackwell API │ │ ├── collective_codegen.cpp │ │ ├── mlir_b2b_gemm_emitter.cpp # Back-to-back GEMM │ │ ├── mlir_builder.cpp │ │ ├── mlir_codegen.cpp │ │ ├── mlir_codegen_utils.cpp │ │ ├── mlir_dual_gemm_emitter.cpp # Dual GEMM (MoE?) │ │ ├── mlir_emitter_base.hpp │ │ ├── mlir_epilogue_emitter.cpp │ │ ├── mlir_fusion_codegen.cpp │ │ ├── mlir_gemm_base_emitter.cpp │ │ ├── mlir_params.cpp │ │ ├── tile_aa_codegen.cpp # TileAA codegen │ │ └── tile_aa_epilogue.cpp │ └── optimizer/ │ ├── autotuner.cpp │ ├── autotuner_cache.cpp │ ├── cache_model.cpp │ ├── canonicalize_ops.cpp │ ├── cask6_compile_utils.cpp │ ├── cask_compile_utils.cpp │ ├── cask_heuristics.cpp │ ├── cask_impl.cpp │ ├── cast_elimination.cpp │ ├── common_utils.cpp │ ├── const_ppg.cpp │ ├── conv_act_pool_fusion.cpp │ ├── conv_lowering.cpp │ ├── conv_w2c_transformation.cpp │ ├── copy_ppg.cpp │ ├── cost-model.cpp │ ├── create_shape_context.cpp │ ├── custom_layer_alias_io.cpp │ ├── custom_layer_internal_transform.cpp │ ├── dce.cpp │ ├── dds_output_scan_dim_to_transform.cpp │ ├── dds_output_size_tensors_t.cpp │ ├── decompose_composite_ops.cpp │ ├── deconv_lowering.cpp │ ├── dep_sep_fusion.cpp │ ├── einsum_helper.cpp │ ├── einsum_transformer.cpp │ ├── enable_batch_matmul.cpp │ ├── extract_reverse_iterator.cpp │ ├── fc_lowering.cpp │ ├── formats.cpp │ ├── formats_util.cpp │ ├── fusion_op_inliner.cpp │ ├── fusion_op_lowering.cpp │ ├── fusion_pipeline.cpp │ ├── fusion_rewrite.cpp │ ├── fusion_rewrite_utils.cpp │ ├── graph_jit.cpp │ ├── gvn/gvn.cpp │ ├── gvn/value_number_adv.cpp │ ├── gvn/value_number_op_attrs.cpp │ ├── inline.cpp │ ├── inout_variant.cpp │ ├── instCombine.cpp │ ├── iv_lowering.cpp │ ├── kgen_cache.cpp │ ├── kqv_gemm_split.cpp # K/Q/V GEMM splitting │ ├── kqv_gemm_split_utils.cpp │ ├── kvcache_update_lowering.cpp │ ├── l2tc_cost_model.cpp │ ├── l2tc_opt.cpp │ ├── l2tc_relations.cpp │ ├── l2tc_scheduler.cpp │ ├── l2tc_utils.cpp │ ├── layouts.cpp │ ├── lgtc_opt.cpp │ ├── lgtc_scheduler.cpp │ ├── linearize.cpp │ ├── loop_conv.cpp │ ├── loop_fusion.cpp │ ├── loop_unroll.cpp │ ├── lower_dds_scanout.cpp │ ├── lower_hwc_group_norm_op.cpp │ ├── lower_hwc_inst_norm_op.cpp │ ├── lower_inst_norm.cpp │ ├── lower_reduce.cpp │ ├── match_dual_gemm.cpp │ ├── match_gemv_mha.cpp # GEMV MHA (decode phase) │ ├── match_group_norm.cpp │ ├── match_ragged_mha.cpp # Ragged/variable-length MHA │ ├── match_vertical_small_batched_gemm.cpp │ ├── memory_prop_ppg.cpp │ ├── merge_bbs.cpp │ ├── mha_fusion.cpp # MHA fusion │ ├── mha_matching_utils.cpp │ ├── mha_prologue_fusion.cpp # MHA prologue fusion │ ├── mmdit_combine_modal.cpp │ ├── pattern_rewriter.cpp │ ├── peephole.cpp │ ├── peephole_epilogue_fusion.cpp │ ├── peephole_h_fusion_util.cpp │ ├── peephole_prologue_fusion.cpp │ ├── quantize_lowering.cpp │ ├── quantize_ppg.cpp │ ├── reduce_copy_fusion.cpp │ ├── reduce_fusion_base.cpp │ ├── refit.cpp │ ├── replace_empty_input_with_zeros.cpp │ ├── replicate_ppg.cpp │ ├── reset_elim.cpp │ ├── reshape_mapping.cpp │ ├── reshape_ppg_1d_conv.cpp │ ├── reshape_ppg.cpp │ ├── resize_lowering.cpp │ ├── scan_fusion.cpp │ ├── sequence_lowering.cpp │ ├── shape_call_lowering.cpp │ ├── shape_graph_opt.cpp │ ├── simplify_attention_op.cpp # Attention simplification │ ├── slice_fill_conv_fusion.cpp │ ├── slice_fusion.cpp │ ├── slice_lowering.cpp │ ├── speed_of_light.cpp │ ├── split_slice_elim.cpp │ ├── stream_sched.cpp │ ├── subgraph_rewriter.cpp │ ├── symbolic_global_padding.cpp │ ├── symbolic_global_padding_impl.cpp │ ├── synthesize_inst.cpp │ ├── tactic.cpp │ ├── tactics_tiled_layout.cpp │ ├── transform_batch_matmul.cpp │ ├── transform_multi_batch.cpp │ ├── transpose_fusion.cpp │ ├── transpose_ppg.cpp │ ├── tunable_graph.cpp │ └── wrap_attention_op_in_kgen.cpp # Wrap attention in kgen └── executor/ ├── analysis/user-buffer.cpp └── instruction/cask_fusion.cpp ```
*Total: 576 source file paths leaked*
---
weird way to frame delivering exactly what the consumer wants as a big market segmentation fuck the user conspiracy