-
Latest Version
-
Operating System
Windows 11
-
User Rating
Click to vote -
Author / Product
-
Filename
cuda_12.4.0_551.61_windows.exe
Sometimes latest versions of the software can cause issues when installed on older devices or devices running an older version of the operating system.
Software makers usually fix these issues but it can take them some time. What you can do in the meantime is to download and install an older version of NVIDIA CUDA Toolkit 12.4.0 (for Windows 11).
For those interested in downloading the most recent release of NVIDIA CUDA Toolkit or reading our review, simply click here.
All old versions distributed on our website are completely virus-free and available for download at no cost.
We would love to hear from you
If you have any questions or ideas that you want to share with us - head over to our Contact page and let us know. We value your feedback!
What's new in this version:
CUDA Components:
- Starting with CUDA 11, the various components in the toolkit are versioned independently.
CUDA Driver:
Running a CUDA application requires the system with at least one CUDA capable GPU and a driver that is compatible with the CUDA Toolkit. See Table 3. For more information various GPU products that are CUDA capable, visit
- Each release of the CUDA Toolkit requires a minimum version of the CUDA driver. The CUDA driver is backward compatible, meaning that applications compiled against a particular version of the CUDA will continue to work on subsequent (later) driver releases.
General CUDA:
Access-counter-based memory migration for Grace Hopper systems is now enabled by default. As this is the first release with the capability enabled, developers may find that applications that had been optimized for earlier memory migration algorithms may see a performance regression if optimized for the earlier behaviors. Should this occur, we introduce a supported but temporary flag to opt out of this behavior. You can control the enablement of this feature by unloading and reloading the NVIDIA UVM driver, as follows:
This release introduces support for the following new features in CUDA graphs:
- Graph conditional nodes (enhanced from 12.3)
- Device-side node parameter update for device graphs
- Updatable graph node priorities without recompilation
Enhanced monitoring capabilities through NVML and nvidia-smi:
- NVJPG and NVOFA utilization percentage
- PCIe class and subclass reporting
- dmon reports are now available in CSV format
- More descriptive error codes returned from NVML
- dmon now reports gpm-metrics for MIG (that is, nvidia-smi dmon --gpm-metrics runs in MIG mode)
- NVML running against older drivers will report FUNCTION_NOT_FOUND in some cases, failing gracefully if NVML is newer than the driver
- NVML APIs to query protected memory information for Hopper Confidential Computing
- This release introduces nvFatbin, a new library to create CUDA fat binary files at runtime.
Confidential Computing General Access:
- Starting in 12.4 with R550.54.14, the Confidential Computing of Hopper will move to General Access for discrete GPU usage.
- All EA RIM certificates prior to this release will be revoked with status PrivilegeWithdrawn 30 days after posting.
CUDA Compilers:
For changes to PTX, refer to
- Added the __maxnreg__ kernel function qualifier to allow users to directly specify the maximum number of registers to be allocated to a single thread in a thread block in CUDA C++.
- Added a new flag -fdevice-syntax-only that ends device compilation after front-end syntax checking. This option can provide rapid feedback (warnings and errors) of source code changes as it will not invoke the optimizer. Note: this option will not generate valid object code.
Add a new flag -minimal for NVRTC compilation. The -minimal flag omits certain language features to reduce compile time for small programs. In particular, the following are omitted:
- Texture and surface functions and associated types (for example, cudaTextureObject_t).
- CUDA Runtime Functions that are provided by the cudadevrt device code library, typically named with prefix “cuda”, for example, cudaMalloc.
- Kernel launch from device code.
- Types and macros associated with CUDA Runtime and Driver APIs, provided by cuda/tools/cudart/driver_types.h, typically named with the prefix “cuda” for example, cudaError_t.
- Starting in CUDA 12.4, PTXAS enables position independent code (-pic) as default when the compilation mode is whole program compilation. Users can opt out by specifying the -pic=false option to PTXAS. Debug compilation and separate compilation continue to have position independent code disabled by default. In future, position independent code will allow the CUDA Driver to share a single copy of text section across contexts and reduce resident memory usage.
CUDA Developer Tools:
- For changes to nvprof and Visual Profiler, see the changelog.
- For new features, improvements, and bug fixes in Nsight Systems, see the changelog.
- For new features, improvements, and bug fixes in Nsight Visual Studio Edition, see the changelog.
- For new features, improvements, and bug fixes in CUPTI, see the changelog.
- For new features, improvements, and bug fixes in Nsight Compute, see the changelog.
- For new features, improvements, and bug fixes in Compute Sanitizer, see the changelog.
- For new features, improvements, and bug fixes in CUDA-GDB, see the changelog.
Resolved Issues:
General CUDA:
- Fixed a compiler crash that could occur when inputs to MMA instructions were used before being initialized.
- CUDA Compilersïƒ
- In certain cases, dp4a or dp2a instructions would be generated in ptx and cause incorrect behavior due to integer overflow. This has been fixed in CUDA 12.4.
Deprecated or Dropped Features:
- Features deprecated in the current release of the CUDA software still work in the current release, but their documentation may have been removed, and they will become officially unsupported in a future release. We recommend that developers employ alternative solutions to these features in their software.
Deprecated Architectures:
- CUDA Toolkit 12.4 deprecates NVIDIA CUDA support for the PowerPC architecture. Support for this architecture is considered deprecated and will be removed in an upcoming release.
Deprecated Operating Systems:
- CUDA Toolkit 12.4 deprecates support for Red Hat Enterprise Linux 7 and CentOS 7. Support for these operating systems will be removed in an upcoming release.
Deprecated Toolchains:
CUDA Toolkit 12.4 deprecates support for the following host compilers:
- Microsoft Visual C/C++ (MSVC) 2017
- All GCC versions prior to GCC 7.3
CUDA Libraries:
- This section covers CUDA Libraries release notes for 12.x releases.
- CUDA Math Libraries toolchain uses C++11 features, and a C++11-compatible standard library (libstdc++ >= 20150422) is required on the host.
Support for the following compute capabilities is removed for all libraries:
- sm_35 (Kepler)
- sm_37 (Kepler)
cuBLAS: Release 12.4:
New Features:
cuBLAS adds experimental APIs to support grouped batched GEMM for single precision and double precision. Single precision also supports the math mode, CUBLAS_TF32_TENSOR_OP_MATH. Grouped batch mode allows you to concurrently solve GEMMs of different dimensions (m, n, k), leading dimensions (lda, ldb, ldc), transpositions (transa, transb), and scaling factors (alpha, beta). Please see cublasgemmGroupedBatched <
cuFFT: Release 12.4:
- New Features
- Added Just-In-Time Link-Time Optimized (JIT LTO) kernels for improved performance in FFTs with 64-bit indexing
- Added per-plan properties to the cuFFT API. These new routines can be leveraged to give users more control over the behavior of cuFFT. Currently they can be used to enable JIT LTO kernels for 64-bit FFTs.
- Improved accuracy for certain single-precision (fp32) FFT cases, especially involving FFTs for larger sizes
Resolved Issues:
- Fixed an issue that could cause overwriting of user data when performing out-of-place real-to-complex (R2C) transforms with user-specified output strides (i.e. using the ostride component of the Advanced Data Layout API).
- Fixed inconsistent behavior between libcufftw and FFTW when both inembed and onembed are nullptr / NULL. From now on, as in FFTW, passing nullptr / NULL as inembed/onembed parameter is equivalent to passing n, that is, the logical size for that dimension.
cuSOLVER: Release 12.4:
New Features:
- cusolverDnXlarft and cusolverDnXlarft_bufferSize APIs were introduced. cusolverDnXlarft forms the triangular factor of a real block reflector, while cusolverDnXlarft_bufferSize returns its required workspace sizes in bytes.
cuSPARSE: Release 12.4:
New Features:
- Added the preprocessing step for sparse matrix-vector multiplication cusparseSpMV_preprocess()
- Added support for mixed real and complex types for cusparseSpMM()
- Added a new API cusparseSpSM_updateMatrix() to update the sparse matrix between the analysis and solving phase of cusparseSpSM()
Resolved Issues:
- cusparseSpVV() provided incorrect results when the sparse vector has many non-zeros
CUDA Math: Release 12.4:
Resolved Issues:
- Host-specific code in cuda_fp16/bf16 headers is now free from type-punning and shall work correctly in the presence of optimizations based on strict-aliasing rules
NPP: Release 12.4:
New Features:
- Enhanced large file support with size_t
- OperaOpera 116.0 Build 5366.21 (64-bit)
- 4K Download4K Video Downloader+ 1.10.4 (64-bit)
- PhotoshopAdobe Photoshop CC 2025 26.2 (64-bit)
- BybitBybit - Register, Trade & Earn Crypto
- iTop VPNiTop VPN 6.2.0 - Fast, Safe & Secure
- Premiere ProAdobe Premiere Pro CC 2025 25.1
- BlueStacksBlueStacks 10.41.650.1046
- Hero WarsHero Wars - Online Action Game
- TradingViewTradingView - Trusted by 60 Million Traders
- LockWiperiMyFone LockWiper (Android) 5.7.2
Comments and User Reviews