Powerful and reliable programming model and computing toolkit

NVIDIA CUDA Toolkit

NVIDIA CUDA Toolkit

  -  3.2 GB  -  Freeware
  • Latest Version

    NVIDIA CUDA Toolkit 12.8.0 (for Windows 11) LATEST

  • Review by

    Daniel Leblanc

  • Operating System

    Windows 11

  • User Rating

    Click to vote
  • Author / Product

    NVIDIA Corporation / External Link

  • Filename

    cuda_12.8.0_571.96_windows.exe

NVIDIA CUDA Toolkit provides a development environment for creating high-performance GPU-accelerated applications. With the CUDA Toolkit, you can develop, optimize, and deploy your applications on GPU-accelerated embedded systems, desktop workstations, enterprise data centers, cloud-based platforms, and HPC supercomputers. The toolkit includes GPU-accelerated libraries, debugging and optimization tools, a C/C++ compiler, and a runtime library to deploy your application.

GPU-accelerated CUDA libraries enable drop-in acceleration across multiple domains such as linear algebra, image and video processing, deep learning, and graph analytics. For developing custom algorithms, you can use available integrations with commonly used languages and numerical packages as well as well-published development APIs.

Your CUDA applications can be deployed across all NVIDIA GPU families available on-premise and on GPU instances in the cloud. Using built-in capabilities for distributing computations across multi-GPU configurations, scientists and researchers can develop applications that scale from single GPU workstations to cloud installations with thousands of GPUs.

IDE with graphical and command-line tools for debugging, identifying performance bottlenecks on the GPU and CPU, and providing context-sensitive optimization guidance. Develop applications using a programming language you already know, including C, C++, Fortran, and Python.

To get started, browse through online getting started resources, optimization guides, illustrative examples, and collaborate with the rapidly growing developer community. Download NVIDIA CUDA Toolkit for PC today!

Features and Highlights
  • GPU Timestamp: Start timestamp
  • Method: GPU method name. This is either "memcpy*" for memory copies or the name of a GPU kernel. Memory copies have a suffix that describes the type of a memory transfer, e.g. "memcpyDToHasync" means an asynchronous transfer from Device memory to Host memory
  • GPU Time: It is the execution time for the method on GPU
  • CPU Time: It is the sum of GPU time and CPU overhead to launch that Method. At driver generated data level, CPU Time is only CPU overhead to launch the Method for non-blocking Methods; for blocking methods it is a sum of GPU time and CPU overhead. All kernel launches by default are non-blocking. But if any profiler counters are enabled kernel launches are blocking. Asynchronous memory copy requests in different streams are non-blocking
  • Stream Id: Identification number for the stream
  • Columns only for kernel methods
  • Occupancy: Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of active warps
  • Profiler counters: Refer the profiler counters section for a list of counters supported
  • grid size: Number of blocks in the grid along the X, Y, and Z dimensions are shown as [num_blocks_X num_blocks_Y num_blocks_Z] in a single column
  • block size: Number of threads in a block along X, Y, and Z dimensions is shown as [num_threads_X num_threads_Y num_threads_Z]] in a single column
  • dyn smem per block: Dynamic shared memory size per block in bytes
  • sta smem per block: Static shared memory size per block in bytes
  • reg per thread: Number of registers per thread
  • Columns only for memcopy methods
  • mem transfer size: Memory transfer size in bytes
  • host mem transfer type: Specifies whether a memory transfer uses "Pageable" or "Page-locked" memory
Also Available: Download NVIDIA CUDA Toolkit for Mac

What's new in this version:

New Features:
This release adds compiler support for the following Nvidia Blackwell GPU architectures:
- SM_100
- SM_101
- SM_120

Tegra-Specific:
- Added MPS support for DRIVE OS QNX
- Added support for GCC 13.2.0
- Added support for Unified Virtual Memory (UVM) with Extended GPU Memory (EGM) arrays

Hopper Confidential Computing:
- Added multi-GPU support for protected PCIe mode
- Added key rotation capability for single GPU passthrough mode

NVML Updates:
- Fixed per-process memory usage reporting for Docker containers using Open GPU Kernel Module drivers
- Added support for DRAM encryption query and control (Blackwell)
- Added checkpoint/restore functionality for userspace applications
- Added support for Blackwell reduced bandwidth mode (RBM)

CUDA Graphs:
- Added conditional execution features for CUDA Graphs:
- ELSE graph support for IF nodes
- SWITCH node support
- Introduced additional performance optimizations
- CUDA Usermode Driver (UMD):
- Added PCIe device ID to CUDA device properties
- Added cudaStreamGetDevice and cuStreamGetDevice APIs to retrieve the device associated with a CUDA stream
- Added CUDA support for INT101010 texture/surface format
- Userspace Checkpoint and Restore:
- Added cross-system process migration support to enable process restoration on a computer different from the one where it was checkpointed
- Added new driver API for checkpoint/restore operations
- Added batch CUDA asynchronous memory copy APIs (cuMemcpyBatchAsync and cuMemcpyBatch3DAsync) for variable-sized transfers between multiple source and destination buffers

CUDA Compiler:
Added two new nvcc flags:
- static-global-template-stub {true|false}: Controls host side linkage for global/device/constant/managed templates in whole program mode
- device-entity-has-hidden-visibility {true|false}: Controls ELF visibility of global/device/constant/managed symbols
- The current default value for both flags is false. These defaults will change to true in our future release. For detailed information about these flags and their impact on existing programs, refer to the nvcc --help command or the online CUDA documentation.
- libNVVM now supports compilation for the Blackwell family of architectures. Compilation of compute capabilities compute_100 and greater (Blackwell and future architectures) uses an updated NVVM IR dialect, based on LLVM 18.1.8 IR (the “modern” dialect) that differs from the older dialect used for pre-Blackwell architectures (a compute capability less than compute_100). NVVM IR bitcode using the older dialect generated for pre-Blackwell architectures can be used to target Blackwell and later architectures, with the exception of debug metadata.
- Nvdisasm now supports emitting JSON formatted SASS disassembly

CUDA Developer Tools:
- For changes to nvprof and Visual Profiler, see the changelog
- For new features, improvements, and bug fixes in Nsight Systems, see the changelog
- For new features, improvements, and bug fixes in Nsight Visual Studio Edition, see the changelog
- For new features, improvements, and bug fixes in CUPTI, see the changelog
- For new features, improvements, and bug fixes in Nsight Compute, see the changelog
- For new features, improvements, and bug fixes in Compute Sanitizer, see the changelog
- For new features, improvements, and bug fixes in CUDA-GDB, see the changelog

Fixed:
CUDA Compiler:
- Resolved compilation issues where code that successfully built with GCC would fail to compile with NVCC on Ubuntu 24.04. This improves cross-compiler compatibility and ensures consistent behavior between GCC and NVIDIA’s CUDA compiler toolchain
- Fixed incorrect handling of C++20 requires expressions, restoring proper functionality and standard compliance. This ensures that compile-time requirements on template parameters now evaluate correctly
- Fixed an issue where NVCC (NVIDIA Compiler Driver) was ignoring the global namespace prefix of a type and thus incorrectly resolving it to a local type that shares the same name
- Fixed a compilation error in NVCC that occurred when code contained three or more nested lambda expressions with variadic arguments. The compiler now properly handles deeply nested variadic lambdas
- Fixed a limitation in NVRTC that caused compilation failures when kernel functions had long identifiers. The runtime compiler now properly handles kernel functions with extended name lengths
- Resolved an issue where template alias resolution could produce incorrect template instances. Previously, when an alias template and its underlying type-id template had different default arguments, the compiler would sometimes incorrectly omit the differing default argument when substituting the alias with its underlying type. This resulted in references to incorrect template instances. The template argument resolution now properly preserves all necessary default arguments during alias substitution
- Fixed invalid error reporting when using variables as template arguments from outside their visible scope. This resolves incorrect diagnostic messages particularly affecting cases involving braced initializers. The compiler now properly validates scope accessibility for template arguments
- Added the ability to cancel ongoing NVRTC compilations through callback mechanisms. This new feature allows developers to safely interrupt and terminate compilation processes programmatically
- The semantics of the -expt-relaxed-constexpr nvcc flag are now documented in the “C++ Language Support” section of the CUDA Programming Guide