Skip to content

Faster ptt#39

Open
36000 wants to merge 12 commits intodipy:masterfrom
36000:faster_ptt
Open

Faster ptt#39
36000 wants to merge 12 commits intodipy:masterfrom
36000:faster_ptt

Conversation

@36000
Copy link
Collaborator

@36000 36000 commented Mar 17, 2026

No description provided.

Copilot AI review requested due to automatic review settings March 17, 2026 14:59
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the PTT (Parallel Transport Tractography) direction getter by switching from global device memory to CUDA texture memory for ODF data, which enables hardware-accelerated trilinear interpolation. It also adds caching for FA and CSD ODF computations, and introduces a seed for reproducible seed generation.

Changes:

  • Replaced global memory reads with CUDA 3D texture memory lookups for PTT, adding texture allocation/deallocation in the GPU tracker and a new ptt_init.cu kernel for texture-based initial direction finding.
  • Added a DATA_T template parameter to genStreamlinesMergeProb_k and tracker_d to support both raw pointer and texture object data access paths.
  • Added FA and CSD ODF disk caching (--cache-dir) and a --seed-seed argument for reproducible seed generation in the run script.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
run_gpu_streamlines.py Added --cache-dir and --seed-seed CLI args; FA/ODF caching logic
cuslines/cuda_python/cu_tractography.py Texture memory allocation/deallocation for PTT; removed unused MEGABYTE import
cuslines/cuda_python/cu_direction_getters.py Updated kernel template names with DATA_T; handle texture objects in launch calls
cuslines/cuda_c/ptt_init.cu New kernel for PTT initial direction finding using texture memory
cuslines/cuda_c/ptt.cu Changed pmf parameter type to cudaTextureObject_t; replaced trilinear interp with tex3D
cuslines/cuda_c/generate_streamlines_cuda.cu Added DATA_T template parameter to tracker_d and genStreamlinesMergeProb_k; included ptt_init.cu
cuslines/cuda_python/cu_propagate_seeds.py Removed a TODO comment

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +170 to +196
if True:
extent = runtime.make_cudaExtent(self.dimt * self.dimx, self.dimy, self.dimz)
dataf_array = checkCudaErrors(runtime.cudaMalloc3DArray(channelDesc, extent, 0))

data_f_rearranged = np.transpose(self.dataf, (3, 0, 1, 2)).reshape(self.dimt * self.dimx, self.dimy, self.dimz)
data_f_rearranged = np.ascontiguousarray(data_f_rearranged, dtype=REAL_DTYPE)
data_f_rearranged = np.transpose(data_f_rearranged, (2, 1, 0))
data_f_rearranged = np.ascontiguousarray(data_f_rearranged, dtype=REAL_DTYPE)
copyParams = runtime.cudaMemcpy3DParms()
copyParams.srcPtr = runtime.make_cudaPitchedPtr(
data_f_rearranged.ctypes.data,
self.dimt * self.dimx * REAL_SIZE,
self.dimt * self.dimx,
self.dimy)
else:
extent = runtime.make_cudaExtent(self.dimx, self.dimy, self.dimz * self.dimt)
dataf_array = checkCudaErrors(runtime.cudaMalloc3DArray(channelDesc, extent, 0))

data_f_rearranged = np.transpose(self.dataf, (3, 2, 1, 0))
data_f_rearranged = np.ascontiguousarray(data_f_rearranged, dtype=REAL_DTYPE)
copyParams = runtime.cudaMemcpy3DParms()
copyParams.srcPtr = runtime.make_cudaPitchedPtr(
data_f_rearranged.ctypes.data,
self.dimx * REAL_SIZE,
self.dimx,
self.dimy
)
Comment on lines +49 to +51
// REAL_T z_query = point.z + (REAL_T)(i * dimz);
// __pmf_data_sh[i] = tex3D<REAL_T>(*pmf, point.x, point.y, z_query);

Comment on lines +68 to +69
// REAL_T z_query = pos.z + (REAL_T)(closest_odf_idx * dimz);
// return tex3D<REAL_T>(*pmf, pos.x, pos.y, z_query);
Comment on lines +180 to +183
fa_cache_file = op.join(args.cache_dir, "fa.npy")
if args.cache_dir != '' and op.exists(fa_cache_file):
print("Loading FA from cache")
FA = np.load(fa_cache_file)
Comment on lines +232 to +256
if args.cache_dir != '' and op.exists(csd_odf_cache_file):
print("Loading CSD ODF from cache")
data = np.load(csd_odf_cache_file)
else:
response_gtab = gtab
response, _ = auto_response_ssst(
response_gtab,
data,
roi_radii=10,
fa_thr=0.7)
model = ConstrainedSphericalDeconvModel(response_gtab, response, sh_order=args.sh_order)
fit = model.fit(data, mask=(FA >= args.fa_threshold))
data = fit.odf(sphere).clip(min=0)
print("Running CSD model...")
unique_bvals = unique_bvals_magnitude(gtab.bvals)
if len(unique_bvals[unique_bvals > 0]) > 1:
low_shell_idx = gtab.bvals <= unique_bvals[unique_bvals > 0][0]
response_gtab = gradient_table( # reinit as single shell for this CSD
gtab.bvals[low_shell_idx],
gtab.bvecs[low_shell_idx])
data = data[..., low_shell_idx]
else:
response_gtab = gtab
response, _ = auto_response_ssst(
response_gtab,
data,
roi_radii=10,
fa_thr=0.7)
model = ConstrainedSphericalDeconvModel(response_gtab, response, sh_order=args.sh_order)
fit = model.fit(data, mask=(FA >= args.fa_threshold))
data = fit.odf(sphere).clip(min=0)

if args.cache_dir != '':
np.save(csd_odf_cache_file, data)
REAL3_T *__probing_pos_sh = probing_pos_sh + tidy;

const REAL_T probe_step_size = ((step_size / PROBE_FRAC) / (PROBE_QUALITY - 1));
const REAL_T probe_step_size = ((step_size / PROBE_FRAC) / (PROBE_QUALITY - 1)); // TODO: is this -1 necessary?
const REAL_T probe_step_size = ((step_size / PROBE_FRAC) / (PROBE_QUALITY - 1)); // TODO: is this -1 necessary?
const REAL_T max_curvature = 2.0 * SIN(max_angle / 2.0) / (step_size / PROBE_FRAC); // This seems to work well
const REAL_T absolpmf_thresh = PMF_THRESHOLD_P * max_d<BDIM_X>(dimt, pmf, REAL_MIN);
const REAL_T absolpmf_thresh = 0; // PMF_THRESHOLD_P * max_d<BDIM_X>(dimt, pmf, REAL_MIN); TODO: try 2.84 for max; i mean, this is completely broken
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants