Faster ptt by 36000 · Pull Request #39 · dipy/GPUStreamlines

36000 · 2026-03-17T14:59:25Z

No description provided.

Copilot

Pull request overview

This PR optimizes the PTT (Parallel Transport Tractography) direction getter by switching from global device memory to CUDA texture memory for ODF data, which enables hardware-accelerated trilinear interpolation. It also adds caching for FA and CSD ODF computations, and introduces a seed for reproducible seed generation.

Changes:

Replaced global memory reads with CUDA 3D texture memory lookups for PTT, adding texture allocation/deallocation in the GPU tracker and a new ptt_init.cu kernel for texture-based initial direction finding.
Added a DATA_T template parameter to genStreamlinesMergeProb_k and tracker_d to support both raw pointer and texture object data access paths.
Added FA and CSD ODF disk caching (--cache-dir) and a --seed-seed argument for reproducible seed generation in the run script.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`run_gpu_streamlines.py`	Added `--cache-dir` and `--seed-seed` CLI args; FA/ODF caching logic
`cuslines/cuda_python/cu_tractography.py`	Texture memory allocation/deallocation for PTT; removed unused `MEGABYTE` import
`cuslines/cuda_python/cu_direction_getters.py`	Updated kernel template names with `DATA_T`; handle texture objects in launch calls
`cuslines/cuda_c/ptt_init.cu`	New kernel for PTT initial direction finding using texture memory
`cuslines/cuda_c/ptt.cu`	Changed `pmf` parameter type to `cudaTextureObject_t`; replaced trilinear interp with `tex3D`
`cuslines/cuda_c/generate_streamlines_cuda.cu`	Added `DATA_T` template parameter to `tracker_d` and `genStreamlinesMergeProb_k`; included `ptt_init.cu`
`cuslines/cuda_python/cu_propagate_seeds.py`	Removed a TODO comment

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cuslines/cuda_python/cu_tractography.py

+        if True:
+            extent = runtime.make_cudaExtent(self.dimt * self.dimx, self.dimy, self.dimz)
+            dataf_array = checkCudaErrors(runtime.cudaMalloc3DArray(channelDesc, extent, 0))
+
+            data_f_rearranged = np.transpose(self.dataf, (3, 0, 1, 2)).reshape(self.dimt * self.dimx, self.dimy, self.dimz)
+            data_f_rearranged = np.ascontiguousarray(data_f_rearranged, dtype=REAL_DTYPE)
+            data_f_rearranged = np.transpose(data_f_rearranged, (2, 1, 0))
+            data_f_rearranged = np.ascontiguousarray(data_f_rearranged, dtype=REAL_DTYPE)
+            copyParams = runtime.cudaMemcpy3DParms()
+            copyParams.srcPtr = runtime.make_cudaPitchedPtr(
+                data_f_rearranged.ctypes.data,
+                self.dimt * self.dimx * REAL_SIZE,
+                self.dimt * self.dimx,
+                self.dimy)
+        else:
+            extent = runtime.make_cudaExtent(self.dimx, self.dimy, self.dimz * self.dimt)
+            dataf_array = checkCudaErrors(runtime.cudaMalloc3DArray(channelDesc, extent, 0))
+
+            data_f_rearranged = np.transpose(self.dataf, (3, 2, 1, 0))
+            data_f_rearranged = np.ascontiguousarray(data_f_rearranged, dtype=REAL_DTYPE)
+            copyParams = runtime.cudaMemcpy3DParms()
+            copyParams.srcPtr = runtime.make_cudaPitchedPtr(
+                data_f_rearranged.ctypes.data,
+                self.dimx * REAL_SIZE,
+                self.dimx,
+                self.dimy
+            )


cuslines/cuda_c/ptt_init.cu

+                // REAL_T z_query = point.z + (REAL_T)(i * dimz);
+                // __pmf_data_sh[i] = tex3D<REAL_T>(*pmf, point.x, point.y, z_query);
+


cuslines/cuda_c/ptt.cu

+    // REAL_T z_query = pos.z + (REAL_T)(closest_odf_idx * dimz);
+    // return tex3D<REAL_T>(*pmf, pos.x, pos.y, z_query);


run_gpu_streamlines.py

+fa_cache_file = op.join(args.cache_dir, "fa.npy")
+if args.cache_dir != '' and op.exists(fa_cache_file):
+    print("Loading FA from cache")
+    FA = np.load(fa_cache_file)


run_gpu_streamlines.py

+  if args.cache_dir != '' and op.exists(csd_odf_cache_file):
+      print("Loading CSD ODF from cache")
+      data = np.load(csd_odf_cache_file)
  else:
-    response_gtab = gtab
-  response, _ = auto_response_ssst(
-    response_gtab,
-    data,
-    roi_radii=10,
-    fa_thr=0.7)
-  model = ConstrainedSphericalDeconvModel(response_gtab, response, sh_order=args.sh_order)
-  fit = model.fit(data, mask=(FA >= args.fa_threshold))
-  data = fit.odf(sphere).clip(min=0)
+    print("Running CSD model...")
+    unique_bvals = unique_bvals_magnitude(gtab.bvals)
+    if len(unique_bvals[unique_bvals > 0]) > 1:
+      low_shell_idx = gtab.bvals <= unique_bvals[unique_bvals > 0][0]
+      response_gtab = gradient_table( # reinit as single shell for this CSD
+        gtab.bvals[low_shell_idx],
+        gtab.bvecs[low_shell_idx])
+      data = data[..., low_shell_idx]
+    else:
+      response_gtab = gtab
+    response, _ = auto_response_ssst(
+      response_gtab,
+      data,
+      roi_radii=10,
+      fa_thr=0.7)
+    model = ConstrainedSphericalDeconvModel(response_gtab, response, sh_order=args.sh_order)
+    fit = model.fit(data, mask=(FA >= args.fa_threshold))
+    data = fit.odf(sphere).clip(min=0)
+
+    if args.cache_dir != '':
+        np.save(csd_odf_cache_file, data)


cuslines/cuda_c/ptt.cu

    REAL3_T *__probing_pos_sh = probing_pos_sh + tidy;

-    const REAL_T probe_step_size = ((step_size / PROBE_FRAC) / (PROBE_QUALITY - 1));
+    const REAL_T probe_step_size = ((step_size / PROBE_FRAC) / (PROBE_QUALITY - 1));  // TODO: is this -1 necessary?


cuslines/cuda_c/ptt.cu

+    const REAL_T probe_step_size = ((step_size / PROBE_FRAC) / (PROBE_QUALITY - 1));  // TODO: is this -1 necessary?
    const REAL_T max_curvature = 2.0 * SIN(max_angle / 2.0) / (step_size / PROBE_FRAC); // This seems to work well
-    const REAL_T absolpmf_thresh = PMF_THRESHOLD_P * max_d<BDIM_X>(dimt, pmf, REAL_MIN);
+    const REAL_T absolpmf_thresh = 0; // PMF_THRESHOLD_P * max_d<BDIM_X>(dimt, pmf, REAL_MIN); TODO: try 2.84 for max; i mean, this is completely broken


36000 added 9 commits March 17, 2026 19:19

transition to tex

5f338dc

more prog

e55bea2

in progress

9f83d5d

in progress

436854c

add rng seed for seeds

11a5465

in progress

5af7623

in progress

bf705d9

in progress

bbf107b

WIP ptt tex memory

c8299f3

Copilot AI review requested due to automatic review settings March 17, 2026 14:59

Copilot started reviewing on behalf of 36000 March 17, 2026 14:59 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

36000 added 3 commits March 18, 2026 11:15

return to st

8e3e550

remove this

ad20879

massive improvement to PTT: ODF LUT

fc32030

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster ptt#39

Faster ptt#39
36000 wants to merge 12 commits intodipy:masterfrom
36000:faster_ptt

36000 commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// REAL_T z_query = point.z + (REAL_T)(i * dimz);
		// __pmf_data_sh[i] = tex3D<REAL_T>(*pmf, point.x, point.y, z_query);

		// REAL_T z_query = pos.z + (REAL_T)(closest_odf_idx * dimz);
		// return tex3D<REAL_T>(*pmf, pos.x, pos.y, z_query);

Conversation

36000 commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants