Powerful and reliable programming model and computing toolkit

NVIDIA CUDA Toolkit

NVIDIA CUDA Toolkit 11.3.0 (for Windows 10)

  -  2.7 GB  -  Freeware

Sometimes latest versions of the software can cause issues when installed on older devices or devices running an older version of the operating system.

Software makers usually fix these issues but it can take them some time. What you can do in the meantime is to download and install an older version of NVIDIA CUDA Toolkit 11.3.0 (for Windows 10).


For those interested in downloading the most recent release of NVIDIA CUDA Toolkit or reading our review, simply click here.


All old versions distributed on our website are completely virus-free and available for download at no cost.


We would love to hear from you

If you have any questions or ideas that you want to share with us - head over to our Contact page and let us know. We value your feedback!

What's new in this version:

CUDA Toolkit Major Component Versions:
CUDA Components:
- Starting with CUDA 11, the various components in the toolkit are versioned independently

CUDA Driver:
- Running a CUDA application requires the system with at least one CUDA capable GPU and a driver that is compatible with the CUDA Toolkit. See Table 2. For more information various GPU products that are CUDA capable
- Each release of the CUDA Toolkit requires a minimum version of the CUDA driver. The CUDA driver is backward compatible, meaning that applications compiled against a particular version of the CUDA will continue to work on subsequent (later) driver releases.

- General CUDA:
- Stream ordered memory allocator enhancements

CUDA Graph Enhancements:
- Enhancements to make stream capture more flexible: Functionality to provide read-write access to the graph and the dependency information of a capturing stream, while the capture is in progress. See cudaStreamGetCaptureInfo_v2() and cudaStreamUpdateCaptureDependencies().
- User object lifetime assistance: Functionality to assist user code in lifetime management for user-allocated resources referenced in graphs. Useful when graphs and their derivatives and asynchronous executions have an unknown/unbounded lifetime not under control of the code that created the resource, such as libraries under stream capture. See cudaUserObjectCreate() and cudaGraphRetainUserObject()
- Graph Debug: New API to produce a DOT graph output from a given CUDA Graph

New Stream Priorities:
- The CUDA Driver API cuCtxGetStreamPriorityRange() now exposes a total of 6 stream priorities, up from the 3 exposed in prior releases
- Expose driver symbols in runtime API
- New CUDA Driver API cuGetProcAddress() and CUDA Runtime API cudaDriverGetEntryPoint() to query the memory addresses for CUDA Driver API functions
- Support for virtual aliasing across kernel boundaries
- Added support for Ubuntu 20.04.2 on x86_64 and Arm sbsa platforms

CUDA Tools:
CUDA Compilers:
- Cu++flt demangler tool
- NVRTC versioning changes
- Preview support for alloca()

Nsight Eclipse Plugin:
- Eclipse versions 4.10 to 4.14 are currently supported in CUDA 11.3

CUDA Libraries:
cuFFT Library:
- cuFFT shared libraries are now linked statically against libstdc++ on Linux platforms
- Improved performance of certain sizes (multiples of large powers of 3, powers of 11) in SM86

cuSPARSE Library:
- Added new routine cusparesSpSV for sparse triangular solver with better performance. The new Generic API supports:
- CSR storage format
- Non-transpose, transpose, and transpose-conjugate operations
- Upper, lower fill mode
- Unit, non-unit diagonal type
- 32-bit and 64-bit indices
- Uniform data type computation

NVIDIA Performance Primitives (NPP):
- Added nppiDistanceTransformPBA functions

Deprecated Features:
- The following features are deprecated in the current release of the CUDA software. The features still work in the current release, but their documentation may have been removed, and they will become officially unsupported in a future release. We recommend that developers employ alternative solutions to these features in their software.

CUDA Libraries:
- cuSPARSE: cusparseScsrsv2_analysis, cusparseScsrsv2_solve, cusparseXcsrsv2_zeroPivot, and cusparseScsrsv2_bufferSize have been deprecated in favor of cusparseSpSV

Tools:
- Nsight Eclipse Plugin: Docker support is deprecated in Eclipse 4.14 and earlier versions as of CUDA 11.3, and Docker support will be dropped for Eclipse 4.14 and earlier in a future CUDA Toolkit release.

Resolved Issues:
General CUDA:
- Historically, the CUDA driver has serialized most APIs operating on the same CUDA context between CPU threads. In CUDA 11.3, this has been relaxed for kernel launches such that the driver serialization may be reduced when multiple CPU threads are launching CUDA kernels into distinct streams within the same context.

cuRAND Library:
- Fixed inconsistency between random numbers generated by GPU and host generators when CURAND_ORDERING_PSEUDO_LEGACY ordering is selected for certain generator types

CUDA Math API:
- Previous releases of CUDA were potentially delivering incorrect results in some Linux distributions for the following host Math APIs: sinpi, cospi, sincospi, sinpif, cospif, sincospif. If passed huge inputs like 7.3748776e+15 or 8258177.5 the results were not equal to 0 or 1. These have been corrected with this release.

Known Issues:
cuBLAS Library:
- The planar complex matrix descriptor for batched matmul has inconsistent interpretation of batch offset
- Mixed precision operations with reduction scheme CUBLASLT_REDUCTION_SCHEME_OUTPUT_TYPE (might be automatically selected based on problem size by cublasSgemmEx() or cublasGemmEx() too, unless CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION math mode bit is set) not only stores intermediate results in output type but also accumulates them internally in the same precision, which may result in lower than expected accuracy. Please use CUBLASLT_MATMUL_PREF_REDUCTION_SCHEME_MASK or CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION if this results in numerical precision issues in your application.

cuFFT Library:
- cuFFT planning and plan estimation functions may not restore correct context affecting CUDA driver API applications
- Plans with strides, primes larger than 127 in FFT size decomposition and total size of transform including strides bigger than 32GB produce incorrect results

cuSOLVER Library:
- For values N<=16, cusolverDn[S|D|C|Z]syevjBatched hits out-of-bound access and may deliver the wrong result. The workaround is to pad the matrix A with a diagonal matrix D such that the dimension of [A 0 ; 0 D] is bigger than 16. The diagonal entry D(j,j) must be bigger than maximum eigenvalue of A, for example, norm(A, ‘fro’). After the syevj, W(0:n-1) contains the eigenvalues and A(0:n-1,0:n-1) contains the eigenvectors.