Powerful and reliable programming model and computing toolkit

NVIDIA CUDA Toolkit

NVIDIA CUDA Toolkit

  -  3.3 GB  -  Freeware
  • Latest Version

    NVIDIA CUDA Toolkit 12.9.0 (for Windows 11) LATEST

  • Review by

    Daniel Leblanc

  • Operating System

    Windows 11

  • User Rating

    Click to vote
  • Author / Product

    NVIDIA Corporation / External Link

  • Filename

    cuda_12.9.0_576.02_windows.exe

NVIDIA CUDA Toolkit provides a development environment for creating high-performance GPU-accelerated applications.

With the CUDA Toolkit, you can develop, optimize, and deploy your applications on GPU-accelerated embedded systems, desktop workstations, enterprise data centers, cloud-based platforms, and HPC supercomputers.



The toolkit includes GPU-accelerated libraries, debugging and optimization tools, a C/C++ compiler, and a runtime library to deploy your application.

GPU-accelerated CUDA libraries enable drop-in acceleration across multiple domains such as linear algebra, image and video processing, deep learning, and graph analytics. For developing custom algorithms, you can use available integrations with commonly used languages and numerical packages as well as well-published development APIs.

Your CUDA applications can be deployed across all NVIDIA GPU families available on-premise and on GPU instances in the cloud. Using built-in capabilities for distributing computations across multi-GPU configurations, scientists and researchers can develop applications that scale from single GPU workstations to cloud installations with thousands of GPUs.

IDE with graphical and command-line tools for debugging, identifying performance bottlenecks on the GPU and CPU, and providing context-sensitive optimization guidance. Develop applications using a programming language you already know, including C, C++, Fortran, and Python.

To get started, browse through online getting started resources, optimization guides, illustrative examples, and collaborate with the rapidly growing developer community. Download NVIDIA CUDA Toolkit for PC today!

Features and Highlights
  • GPU Timestamp: Start timestamp
  • Method: GPU method name. This is either "memcpy*" for memory copies or the name of a GPU kernel. Memory copies have a suffix that describes the type of a memory transfer, e.g. "memcpyDToHasync" means an asynchronous transfer from Device memory to Host memory
  • GPU Time: It is the execution time for the method on GPU
  • CPU Time: It is the sum of GPU time and CPU overhead to launch that Method. At driver generated data level, CPU Time is only CPU overhead to launch the Method for non-blocking Methods; for blocking methods it is a sum of GPU time and CPU overhead. All kernel launches by default are non-blocking. But if any profiler counters are enabled kernel launches are blocking. Asynchronous memory copy requests in different streams are non-blocking
  • Stream Id: Identification number for the stream
  • Columns only for kernel methods
  • Occupancy: Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of active warps
  • Profiler counters: Refer the profiler counters section for a list of counters supported
  • grid size: Number of blocks in the grid along the X, Y, and Z dimensions are shown as [num_blocks_X num_blocks_Y num_blocks_Z] in a single column
  • block size: Number of threads in a block along X, Y, and Z dimensions is shown as [num_threads_X num_threads_Y num_threads_Z]] in a single column
  • dyn smem per block: Dynamic shared memory size per block in bytes
  • sta smem per block: Static shared memory size per block in bytes
  • reg per thread: Number of registers per thread
  • Columns only for memcopy methods
  • mem transfer size: Memory transfer size in bytes
  • host mem transfer type: Specifies whether a memory transfer uses "Pageable" or "Page-locked" memory
PROS
  • Massive Parallel Processing Power
  • Optimized for NVIDIA GPUs
  • Strong Developer Support
  • Wide AI & HPC Applications
  • Seamless Integration with Libraries
CONS
  • Limited to NVIDIA GPUs
  • Steep Learning Curve
  • High Power Consumption
  • Hardware Upgrade Costs
  • Not Ideal for All Workloads
Also Available: Download NVIDIA CUDA Toolkit for Mac

Why is this app published on FileHorse? (More info)

What's new in this version:

General CUDA:
- MPS client termination is now supported on Tegra platforms (For L4T users - starting JetPack 7.0 only). More details can be found here
- Extended CUDA in Graphics (CIG) mode now supports Vulkan, expanding beyond the previous DirectX-only implementation
- CPU NUMA allocation support through cuMemCreate and cuMemAllocAsync is now available on Windows when using the driver in WDDM and MCDM modes, expanding this previously Linux-only feature
- CUDA Graphs functionality has been enhanced to support the inclusion of memory nodes in child graphs
- CUDA Toolkit 12.9 adds compiler target support for SM architecture 10.3 (sm_103, sm_103f, and sm_103a), enabling development for the latest GPU architectures with specific optimizations for each variant

CUDA Toolkit 12.9 introduces compiler support for a new target architecture class: family-specific architectures. Learn more: NVIDIA Blog: Family-Specific Architecture Features:
Multiple enhancements to NVML and nvidia-smi:
Added counters (in microseconds) for the throttling time for the following reasons:
- nvmlClocksEventReasonGpuIdle
- nvmlClocksEventReasonApplicationsClocksSetting
- nvmlClocksEventReasonSwPowerCap
- nvmlClocksThrottleReasonHwSlowdown
- nvmlClocksEventReasonSyncBoost
- nvmlClocksEventReasonSwThermalSlowdown
- nvmlClocksThrottleReasonHwThermalSlowdown
- nvmlClocksThrottleReasonHwPowerBrakeSlowdown
- nvmlClocksEventReasonDisplayClockSetting
- Improved consistency for device identification between CUDA and NVML
- Added NVML chip-to-chip (C2C) telemetry APIs
- Added CTXSW metrics
- Implemented GPU average power counters
- Added PCIe bind/unbind events

CUDA Compiler:
- Added a new compiler option --Ofast-compile=<level>, supported in nvcc, nvlink, nvrtc, and ptxas. This option prioritizes faster compilation over optimizations at varying levels, helping to accelerate development cycles. Refer to the fast-compile documentation for more details.
- Added a new compiler option --frandom-seed=<seed>, supported in nvcc and nvrtc. The user specified random seed will be used to replace random numbers used in generating symbol names and variable names. The option can be used to generate deterministically identical ptx and object files. If the input value is a valid number (decimal, octal, or hex), it will be used directly as the random seed. Otherwise, the CRC value of the passed string will be used instead. NVCC will also pass the option, as well as the user specified value to host compilers, if the host compiler is either GCC or Clang, since they support -frandom-seed option as well. Users are responsible for assigning different seeds to different files.

CUDA Developer Tools:
- For changes to nvprof and Visual Profiler, see the changelog
- For new features, improvements, and bug fixes in Nsight Systems, see the changelog
- For new features, improvements, and bug fixes in Nsight Visual Studio Edition, see the changelog
- For new features, improvements, and bug fixes in CUPTI, see the changelog
- For new features, improvements, and bug fixes in Nsight Compute, see the changelog
- For new features, improvements, and bug fixes in Compute Sanitizer, see the changelog
- For new features, improvements, and bug fixes in CUDA-GDB, see the changelog

Fixed:
CUDA Compiler:
- Resolved a segmentation fault that occurred when a lambda expression used as a class template argument was invoked inside a function template
- Resolved NVCC internal assertion triggered when inheriting protected constructors
- Resolved issue where C++20 template parameter lists in lambdas and the new auto syntax were causing nvcc to fail
- Fixed issue with incorrect C++20 if constexpr(concept) usage in template lambda
- Resolved template compile error in CUDA 12.6.1 when using MSVC with C++20
- Fixed NVCC issue with incorrect initialization of std::vector of std::any in C++ code