Two locus general matrix by lkirk · Pull Request #3426 · tskit-dev/tskit

lkirk · 2026-03-10T14:22:51Z

Description

This is the last of the required components for the LD matrix methods. I wanted feedback on the API before I add documentation, but this method is complete and tested. The final things to do are to leak check the cpython code and add some documentation.

This feature enables a user to implement their own two-locus count statistic in python, similar to ts.sample_count_stat. User functions take two arguments, the first is a matrix of haplotype counts and the second is a vector of sample set sizes. For instance, this is how we would implement D with this api:

def D(X, n):
    pAB, pAb, paB = X / n
    pA = pAb + pAB
    pB = paB + pAB
    return pAB - (pA * pB)

Since this API supports multiallelic sites, the user can also pass a normalisation function to control how the data is normalised across multiple alleles. The normalisation function is only run when computing over multiallelic sites. I've set the default to be $1/(n_A n_B)$, which is simply the arithmetic mean of the alleles in a given pair of sites. This will suffice in the majority of cases (the only outlier is $r^2$, for which there is already a python API). We also support computing statistics between sample sets.

The user would use the above summary function like this:

ts.two_locus_count_stat(ts.samples(), D, 1)

Where 1 specifies the length of the output array, we always require 1 dimension -- same as the ts.sample_count_stat function.

PR Checklist:

Tests that fully cover new/changed functionality.
Documentation including tutorial content if appropriate.
Changelogs, if there are API changes.

codecov · 2026-03-10T14:24:55Z

Codecov Report

❌ Patch coverage is 95.34884% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.96%. Comparing base (afaf3b9) to head (824faaa).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3426      +/-   ##
==========================================
+ Coverage   91.92%   91.96%   +0.03%     
==========================================
  Files          37       37              
  Lines       32153    32354     +201     
  Branches     5143     5146       +3     
==========================================
+ Hits        29556    29753     +197     
- Misses       2264     2270       +6     
+ Partials      333      331       -2

Flag	Coverage Δ
C	`82.69% <66.66%> (-0.02%)`	⬇️
c-python	`77.68% <93.43%> (+0.33%)`	⬆️
python-tests	`96.40% <100.00%> (+<0.01%)`	⬆️
python-tests-no-jit	`33.20% <17.64%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
Python API	`98.70% <100.00%> (+<0.01%)`	⬆️
Python C interface	`91.35% <95.62%> (+0.11%)`	⬆️
C library	`88.88% <86.66%> (+0.02%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

python/_tskitmodule.c

c/tskit/trees.c

apragsdale · 2026-03-13T23:47:24Z

Thank you for opening this, @lkirk. I'm excited to see this implemented! Do we need to do any testing to demonstrate that there are no memory leaks or anything like that, which wouldn't be included in the test suite?

I know it could be found in the tests, but it may be helpful to spell out here how the API would work for a two- or more-way stat? Would it be:

ts.two_locus_count_stat([sample_list_1, sample_list_2], two_way_stat_func, 2)

lkirk · 2026-03-14T01:04:16Z

Yes, this needs leak checking and documentation, which I can add. I mostly wanted to make sure the user interface made sense first.

I know it could be found in the tests, but it may be helpful to spell out here how the API would work for a two- or more-way stat? Would it be:
ts.two_locus_count_stat([sample_list_1, sample_list_2], two_way_stat_func, 2)

The result_dim argument to the function gives the dimensions of the summary function output. The output is required to be 1D, so result_dim really tells us the length of the returned vector. In most cases, it'll be 1. See this line in the tests. So, if your summary function returned 1 value for a pair of sites, then the function call will look like this:

ts.two_locus_count_stat([sample_list_1, sample_list_2], two_way_stat_func, 1)

two_locus_count_stat was designed to work like sample_count_stat

petrelharp · 2026-03-14T15:56:30Z

Hi, @lkirk! This looks pretty straightforward. To have a careful opinion about the API, I think I need to see a reasonably careful docstring? For instance, what exactly are the arguments to f and norm_f? I can probably figure it out by tracing through the code, but it'll be less error-prone if you write it down. I'll have a look through for other issues, but it will be much easier to have the description in words of what exactly it's trying to do.

python/tests/test_ld_matrix.py

lkirk · 2026-03-14T16:31:58Z

Hi @petrelharp, thanks for taking a look. I didn't offer much of a description before, does this help?

Summary Function

In the sample_count_stat api, there are 3 required parameters:

sample_sets: List of lists of node ids.
f: A summary function that takes a one-dimensional array of length equal to the number of sample sets and returns a one-dimensional array.
output_dim: The length of the summary function's return value.

The two_locus_count_stat function (see function signature below) takes the same basic inputs except that f takes a matrix of haplotype counts and a vector of sample set sizes as input.

I've been writing summary functions with the signature f(X, n)

Where X is a matrix whose rows correspond to sample sets and columns correspond to haplotype counts ($w_{Ab}, w_{aB}, w_{AB}$). Here's a sample of what that looks like:

        sample_set 1, sample_set 2
w_Ab  [[9.            9.]
w_aB   [0.            0.]
w_AB   [0.            0.]]

Why lay the data out this way? Because numpy is row-major, iteration over the arrays gives us rows. That makes it easy to select the haplotype counts in one line:

AB, Ab, aB = X

Note: under the hood, the data is still laid out in an optimal way (at least as far as I can tell -- see this note)

In addition, the sample set sizes is shaped so that we can use it to normalize the counts in one go (n is a vector of sample set sizes)

pAB, pAb, paB = X / n

Finally, since we're no longer controlling the summary functions ourselves, the user has control over polarisation (see polarised parameter).

Normalisation

Perhaps the most clunky thing about this api is that the user will need a normalisation function for multiallelic data (norm_f parameter). Its function signature is the same as the code internal to ld_matrix(): f(X, n, nA, nB) -> [out_dims] where X and n are the same, nA and nB are the number of A and B alleles, respectively. Its output dimensions should be the same as our summary function (we also validate this just like we validate the output dims of our summary function).

The default normalization function of two_locus_count_stat is the total (or uniform) normalization, we average all results into one. Here's how I have defined it in the function signature

lambda X, n, nA, nB: np.expand_dims(1 / (nA * nB), axis=0),

I would have loved to be able to pass np.mean as a summary function instead, but we do need haplotype counts to norm $r^2$ and this keeps our original ld_matrix implementation untouched (we use the same exact machinery for this method, so we have to follow the C api).

c/tskit/trees.c

petrelharp · 2026-03-14T16:54:41Z

I've looked at the code pretty carefully, and things look good besides the comments. Nice work!

lkirk · 2026-03-14T18:07:03Z

Hi @petrelharp, I appreciate you taking a look. I've responded to everything with how I plan to resolve the comment. I did have one clarifying question here. I will clean this up tomorrow.

petrelharp · 2026-03-15T15:11:02Z

Hi, @lkirk - I went ahead and made a few adjustments to how the examples are chosen for testing; if you see something wrong or disagree, feel free to revert. Rationale:

Instead of listing "don't run these tests since they're slow" it's now "do run these tests" because otherwise if someone adds a new slow example to get_example_tree_sequence it won't slow things down mysteriously
Instead of pulling that one test case out I just call the function that generates it. This makes it a lot more clear what's going on.

petrelharp · 2026-03-15T15:21:39Z

Hm, I also changed the tests that pulled out "all_fields" to explicitly call that function. Now I see that we have a fixture defined for that one. Do you mind removing that function call and using the fixture instead? So that just means doing

def test_general_one_way_two_locus_stat_multiallelic(stat, ts_fixture):

and then using ts instead of ts_fixture in the code (or renaming it).

I'd do this but don't want to risk making conflicting edits.

jeromekelleher · 2026-03-16T09:42:26Z

I'm happy to look over this at some stage - please ping me when you'd like input!

python/tskit/trees.py

petrelharp · 2026-03-16T23:38:45Z

Ah, just noticed your expanded description. So, to answer your question: the API looks good. I may have some other comments now, but will wait for you to ping me when it's ready to read carefuly (including the docstring).

lkirk · 2026-03-16T23:48:58Z

thanks @petrelharp, glad to know it seems reasonable. I'm currently running down a bug that appears at first glance to be macos specific (it shook out with my updates). I'm hoping to be done with this in the next few hours, I'll summarize the changes and ping you when ready. For now, you might see a few more spurious pushes until I figure out what's going on. I don't have any way to test this locally on a mac.

python/tskit/trees.py

petrelharp · 2026-03-16T23:55:15Z

python/tskit/trees.py

+            return result.reshape(result.shape[:2])
+        # Orient the data so that the first dimension is the sample set so that
+        # we get one LD matrix per sample set.
+        return result.swapaxes(0, 2).swapaxes(1, 2)


I'm confused about this: since the summary function takes in all the sample sets at once, it doesn't have any notion of "one LD matrix per sample set"?

(and if this is necessary, note that np.moveaxis would do this in a single operation)

For data locality when computing stats over sample sets, we want the sample set to be the last dimension. However, for usability @apragsdale and I thought it'd be better to have the sample set as the first dimension in the output. This makes it possible to access individual LD matrices with result[0], etc... compared to result[:,:,0]. So, moving the axis will be necessary. I haven't seen np.moveaxis before, the LD matrix also use swapaxes twice and would benefit from this change as well.

In summary, we now move (m, m, k) to (k, m, m) with np.moveaxis, where m is the number of sites/trees and k is the number of ~~sample sets~~ output dimensions.

But now that I look more closely, say, at diversity

[nav] In [6]: ts.diversity(sample_sets=[A, B], windows="sites", mode="site").shape Out[6]: (73, 2)

I realize that this decision was a deviation from other stats, where sample set is the last dimension.

ah. we definitely want this to match the other stats. I had not done that double-checking, but had assumed you had.

This was a conscious decision with @apragsdale, we thought it was a bit more convenient to access single LD matrices with result[0] instead of result[:,:,0], but it sounds like we should change to the latter (last dimension is sample set). I've chatted with @apragsdale about this and he sees the logic in this change. Since we use the C statistics to test the general stats, if I make the change for our general stat, I'll need to do it across the board. I think it'll only affect a few tests that explicitly test output dimensions.

This does sound like a pain, and given you've thought about this hard, I could be convinced otherwise!

…'t have to add a dimension at the end for scalar operations

… fix norm func for r2_ij

…lematic in practice

lkirk · 2026-03-17T23:47:22Z

@petrelharp, thank you for taking such a close look. I'll summarize what was done since your last review.

Testing

The tests have been cleaned up substantially. Your comment about pytest fixtures gave me the idea to put a few tree sequences in their own fixtures, where I assert the properties directly in the fixture itself. ts_100_samp_with_sites_fixture has the same properties as the n=100_m=32_rho=0.5 example tree sequence, but its definition is now localized to test_ld_matrix.py, so there's less risk of changes to the example tree sequences affecting the general_ tests.

I've moved away from hard-coding the number of expected dimensions in the output (python C tests). I wasn't intending on keeping the platform code in the test, but I wanted a single clean commit showing exactly the (bug?) I was running into. My understanding is that most PRNGs are portable across architectures, so it was rather surprising to me that the same exact msprime simulation produces different tree sequences (my best guess without digging into msprime is that there may be some floating point jitter in rejection sampling? -- unless my assumptions about PRNGs are incorrect). All this being said, do you think it's appropriate to open an issue in msprime?

In test_python_c.py, I've split the biallelic and multiallelic test cases to isolate the testing of the normalization function. It's also nice to have one tree sequence defined and one set of happy-path inputs.

The python C test coverage seems to be at a reasonable level at this point.

Dimensions

I've re-reworked some of the dimension handling code for normalization. The changes you observed before were an attempt to remove the need for expand_dims in a lot of places -- adopting the philosophy of propagating the dimensions through to the end. I did a couple of microbenchmarks (happy to share these if you want) to simply decide the right thing to do here. It turns out that [value] is both the fastest and most readable. I think putting vectorized operations and/or fancy indexing operations in the hot loop adds an appreciable amount of overhead. I worked to get these summary functions looking as idiomatic as possible so that they can be used as a reference. There are definitely small changes that can be made to make them faster, but I wanted to balance readability. This balance is perhaps more critical here than in other places because computational cost adds up quickly in the two-locus methods.

Documentation

I've drafted up a docstring for this method, but I'll want to reference a few sections defined in #3416. I can rebase this branch from that one, just to get the definitions I need. If I do that, we'll want #3416 merged before this PR.

lkirk · 2026-03-18T00:12:39Z

@jeromekelleher I think this is ready for a look. I don't anticipate any more changes to the CPython layer, we're still discussing a couple of points, but I think your feedback would be useful at this point.

python/tests/test_python_c.py

petrelharp · 2026-03-18T03:33:52Z

python/tskit/trees.py

+        """
+        Compute two-locus statistics with a user-defined python function that
+        operates on haplotype counts. TODO: reference modes in two-locus docs.
+        On each pair of sites or trees, the summary function is provided with


I think you need to be more specific about the summary function: not just "is provided with" but say what, exactly, is the signature. (e.g., "must accept two arguments: the first, X, is ...")

I clarified the requirements for summary functions and norm functions in my recent change.

petrelharp · 2026-03-18T03:34:59Z

python/tskit/trees.py

+        matrix whose rows contain haplotype counts per sample set (counts of AB,
+        Ab, aB) and ``n`` is a vector of sample set sizes.


Say what AB, Ab, and aB are? Or perhaps we can refer to #3416 here?

hm, #3416 doesn't currently explain what these are

I typically call these "Haplotype Counts" with A/B as the derived and a/b as the ancestral. Though the ancestral/derived states don't matter in the cases of biallelic sites and branches (biallelic as implemented -- a single node bifurcates "ancestral" and "derived" or vice versa).

The reason that this distinction doesn't matter too much when we're only considering 2 alleles is that LD statistics are symmetric. Swapping ancestral/derived states doesn't change the outcome (I recall testing this a while ago when I was working on normalisation).

That's all to say that A/B and a/b really just represent 2 states (or two alleles). When combined with n (the sample set size), haplotype counts AB, Ab, and aB comprise all of the information we can gather about the linkage of two sites/trees. I'm not exactly sure how much of this to spell out in #3416 or in this docstring.

petrelharp · 2026-03-18T03:37:44Z

python/tskit/trees.py

+
+        .. note::
+            Because we are operating on very small matrices/vectors, vectorised
+            operations are often times slower than operations on scalars. Simply
+            returning ``[value]`` can be faster than returning
+            ``value[np.newaxis,]`` or ``np.expand_dims(value, 0)``.


I don't understand the significance of this? Perhaps be a bit more explicit, if it's a big deal, or omit this?

(what I don't get is why you'd be returning these things)

Perhaps this note is better suited for long-form documentation instead of in the docstring -- I've removed it. I'm mostly trying to guide people to "best practices" that I've observed while testing this code.

There's two concerns wrapped up into one here:

Vectorized numpy array operations in the hot loop on very small vectors/matrices are more expensive than performing scalar operations. The examples in test_ld_matrix.py mix these a bit for the sake of readability. But, it can be good to know where potential bottlenecks in summary functions will be. Perhaps this is a better fit for the documentation.

The summary functions must return a 1D array with length=result_dim, these are just different ways of constructing the output array. It turns out that the best way is to simply construct the output array by wrapping it in a python list (i.e. [value]), over using np.expand_dims(value, 0).

petrelharp · 2026-03-18T03:39:45Z

python/tskit/trees.py

+        What follows is an example of computing ``D`` from a tree sequence. Many
+        more examples can be found in the test suite
+        ``test_ld_matrix.py::GeneralStatsFuncs``. Let's begin with our summary


Perhaps link to the ld_matrix docs for D? And good idea, but I don't think we want to refer to the test suite in the docstring: if you'd like to leave this pointer in, perhaps put a comment in the function body?

I removed the reference to GeneralStatsFuncs. I do think people should look at the GeneralStatsFuncs class to get an idea for how to implement summary functions. They're tested and implement all of the C summary functions. I suppose I can directly address this in the documentation that will accompany this code.

petrelharp · 2026-03-18T03:41:09Z

python/tskit/trees.py

+                pB = paB + pAB
+                return pAB - (pA * pB)
+
+        ``norm_f`` is a normalisation function used to combine all computed


I think we need to give the precise requirements for the norm function here.

I changed the language to be more precise.

petrelharp · 2026-03-18T03:41:41Z

python/tskit/trees.py

+        will be called. The default normalisation function is identical to
+        ``total_norm`` shown in the example below. ``hap_norm`` is required for
+        normalising :math:`r^2`. Both of these examples return a numpy array
+        with length equal to the number of sample sets (for one-way stats).


It's not clear what "for one-way stats" means here.

Good point, I've written a more general description that relies on the mapping from sample_set->stat->output dim instead of referring to one-way stats, which aren't a good tool for explaining what I mean here.

I've made a change that expresses this more directly I think.

python/tskit/trees.py

petrelharp · 2026-03-18T03:48:22Z

python/tskit/trees.py

+        )
+        # Orient the data so that the first dimension is the sample set so that
+        # we get one LD matrix per result dimension
+        return np.moveaxis(result, -1, 0)


maybe we should do it this way also above at this line? (one call to moveaxis instead of two)

and, the comment here is not right yet: says "first dimension is the sample set"

petrelharp · 2026-03-18T04:01:10Z

python/_tskitmodule.c

+    ret = (PyObject *) result_matrix;
+    result_matrix = NULL;
+out:
+    Py_XDECREF(summary_func);


Hm, you're decref'ing summary_func but not norm_func; that seems wrong. Have you checked for memory leaks in this code yet? (I need to look up how to do that...)

I made a pass for memory leaks and found one, I've revised the summary and norm function methods above so that they don't perform intermediate transposes (which were leaking memory -- I believe they're allocating memory for new dimensions, not the underlying data).

I performed some leak checking with/without Py_XDECREF(norm_func) and observed a leak. Fixed in my recent update.

I'll make a comment on the main thread about memory leak checking.

c/tskit/trees.c

petrelharp · 2026-03-18T04:35:37Z

Great, I'll have a careful look. Let's see:

Should there be a test that X is actually read-only?
I think you've not got a test that running on a ts with no sites in site mode gives the correct thing (and doesn't error)?
I think I see why we need both tsk_treeseq_two_locus_count_stat and tsk_treeseq_two_locus_count_general_stat; I think some explanatory comments about the difference would be very helpful here.

do you think it's appropriate to open an issue in msprime?

No, this is a known thing. Maybe the default RNG is different on macOS? But, not something we're intending to go after.

It turns out that [value] is both the fastest and most readable.

Sorry, I'm being slow. What's the context here?

I'm not seeing any other potential issues? The python-c is hard stuff to wade through; good job getting that working. I won't pretend I understand the ramifications of all that.

petrelharp · 2026-03-18T04:37:00Z

I think the failing docs build is just because you need a linebreak after .. code-block:: python?

Add a test for behavior on empty tree sequences (no samples, no edges, no sites). Add a "no sites" fixture. Include branch stat testing. Tune branch stat test runtime by reducing the size of `ts_100_samp_with_sites_fixture`, now named `ts_10_samp_with_sites_fixture`. Add explicit testing for output dimensions and assert that the norm func is not called on trees with only biallelic sites and in branch mode. Add a GeneralStatNormFuncs class to explicitly document possible normalisation functions and in what situations they will be used. Tune size of tree sequence in `test_general_multi_outputs` so that the test runs in a reasonable amount of time in branch mode.

The transpose operation was creating intermediate data that was not being garbage collected, resulting in a rather obvious memory leak. To mitigate this, I opt to wrap the data in a numpy array that is already transposed. The original data is natively laid out with shape (K,3), by creating a numpy array with shape (3,K) and strides (8,8*K), we can avoid an intermediate transpose operation altogether. After leak-checking again, the memory leak is gone. I also add a Py_XDECREF to remove the reference to `norm_func`, though in my leak checking I don't actually see any difference in RSS or heap size. Finally, mark a few more arrays as read-only, since the C functions that accept them as input annotate these arrays as `const`.

`compute_two_tree_branch_stat` did not check the error returned by the summary function (which is the return value of `compute_two_tree_branch_state_update`.). I caught this because the python callback was setting an exception in the summary function and the python runtime was complaining about an exception being set, despite a successful return status. This also means that failing summary functions would (eventually) be caught, but the code would continue to run. The C python tests did not catch this because we would eventually raise the correct exception.

Provide more precise requirements for `f` and `norm_f` and give some basic understanding of what these functions are and when normalisation will be required. Attempt to fix syntax errors by adding a newline in code blocks. Clarify output dimensions in the code comments (though this might need to change since I think we'll remove the `np.moveaxis` call at the end.

lkirk

@petrelharp Sorry this update took so long.

Updates

Overall, I think I've gotten to every issue that was raised except for the np.swapaxes call at the end of two_locus_count_stat in trees.py. I'll make a separate commit for this, I think this will touch many small things.

Docstring

I made some improvements to the docstring, hopefully this is a bit closer to what you had in mind.

Leak Checking the Python C Code

In my initial memory leak test, I discovered a memory leak which lead to discovering a small bug in the branch stats (we weren't propagating the return status of the summary function -- it's fixed now).

The memory leak that I found was caused by performing a transpose operation before calling the summary function, that created a new container for the data, leaving a dangling reference to the old one. It looks like we're leak free after fixing this change (I also fixed the branch stat bug). I removed the intermediate transpose operation by wrapping the C data with a "pre-transposed" numpy array. I create a new array with PyArray_New so that I can provide the correct dimensions and stride, instead of relying on PyArray_SimpleNewFromData, which doesn't properly set the stride if the dimensions are swapped.

I'll include some selected results from leak checking here, but a more thorough write up can be found in this doc two-locus-general-leak-check.pdf, along with the script I used leak-check.

The following plots show resident set size for the process (rss) and the size of the python heap (heap) for sampled time points. I'm running two_locus_count_stat 10,000 times in a for-loop, recording heap and rss as time progresses (using memray). The document I attached also contains a couple of examples where I induce a memory leak in two_locus_count_stat to show that we can indeed detect large and small memory leaks with this method.

Site Mode

I checked for memory leaks in the side mode with trees that contained biallelic and multiallelic sites. I also produced an error in the summary function to ensure that we're cleaning up properly on error.

Branch Mode

I checked for memory leaks in the branch mode with a much smaller tree, I also included a test case that contains the same summary function error.

lkirk · 2026-03-22T20:18:03Z

python/_tskitmodule.c

+    ret = (PyObject *) result_matrix;
+    result_matrix = NULL;
+out:
+    Py_XDECREF(summary_func);


I made a pass for memory leaks and found one, I've revised the summary and norm function methods above so that they don't perform intermediate transposes (which were leaking memory -- I believe they're allocating memory for new dimensions, not the underlying data).

I performed some leak checking with/without Py_XDECREF(norm_func) and observed a leak. Fixed in my recent update.

I'll make a comment on the main thread about memory leak checking.

lkirk · 2026-03-22T20:19:10Z

python/tskit/trees.py

+
+        .. note::
+            Because we are operating on very small matrices/vectors, vectorised
+            operations are often times slower than operations on scalars. Simply
+            returning ``[value]`` can be faster than returning
+            ``value[np.newaxis,]`` or ``np.expand_dims(value, 0)``.


Perhaps this note is better suited for long-form documentation instead of in the docstring -- I've removed it. I'm mostly trying to guide people to "best practices" that I've observed while testing this code.

There's two concerns wrapped up into one here:

Vectorized numpy array operations in the hot loop on very small vectors/matrices are more expensive than performing scalar operations. The examples in test_ld_matrix.py mix these a bit for the sake of readability. But, it can be good to know where potential bottlenecks in summary functions will be. Perhaps this is a better fit for the documentation.

The summary functions must return a 1D array with length=result_dim, these are just different ways of constructing the output array. It turns out that the best way is to simply construct the output array by wrapping it in a python list (i.e. [value]), over using np.expand_dims(value, 0).

lkirk · 2026-03-22T20:19:27Z

python/tskit/trees.py

+        What follows is an example of computing ``D`` from a tree sequence. Many
+        more examples can be found in the test suite
+        ``test_ld_matrix.py::GeneralStatsFuncs``. Let's begin with our summary


I removed the reference to GeneralStatsFuncs. I do think people should look at the GeneralStatsFuncs class to get an idea for how to implement summary functions. They're tested and implement all of the C summary functions. I suppose I can directly address this in the documentation that will accompany this code.

lkirk · 2026-03-22T20:40:42Z

python/tskit/trees.py

+        will be called. The default normalisation function is identical to
+        ``total_norm`` shown in the example below. ``hap_norm`` is required for
+        normalising :math:`r^2`. Both of these examples return a numpy array
+        with length equal to the number of sample sets (for one-way stats).


Good point, I've written a more general description that relies on the mapping from sample_set->stat->output dim instead of referring to one-way stats, which aren't a good tool for explaining what I mean here.

I've made a change that expresses this more directly I think.

lkirk · 2026-03-22T20:42:11Z

python/tskit/trees.py

+        matrix whose rows contain haplotype counts per sample set (counts of AB,
+        Ab, aB) and ``n`` is a vector of sample set sizes.


I typically call these "Haplotype Counts" with A/B as the derived and a/b as the ancestral. Though the ancestral/derived states don't matter in the cases of biallelic sites and branches (biallelic as implemented -- a single node bifurcates "ancestral" and "derived" or vice versa).

The reason that this distinction doesn't matter too much when we're only considering 2 alleles is that LD statistics are symmetric. Swapping ancestral/derived states doesn't change the outcome (I recall testing this a while ago when I was working on normalisation).

That's all to say that A/B and a/b really just represent 2 states (or two alleles). When combined with n (the sample set size), haplotype counts AB, Ab, and aB comprise all of the information we can gather about the linkage of two sites/trees. I'm not exactly sure how much of this to spell out in #3416 or in this docstring.

c/tskit/trees.c

python/tests/test_python_c.py

lkirk · 2026-03-22T21:25:36Z

python/tskit/trees.py

+                pB = paB + pAB
+                return pAB - (pA * pB)
+
+        ``norm_f`` is a normalisation function used to combine all computed


I changed the language to be more precise.

python/tskit/trees.py

lkirk · 2026-03-22T21:26:58Z

python/tskit/trees.py

+        """
+        Compute two-locus statistics with a user-defined python function that
+        operates on haplotype counts. TODO: reference modes in two-locus docs.
+        On each pair of sites or trees, the summary function is provided with


I clarified the requirements for summary functions and norm functions in my recent change.

…ng to ``result_dim`` fixes the docs.

petrelharp · 2026-03-23T05:29:44Z

Let me know when you're done here, and I'll take another pass through! Thanks for the careful attention.

lkirk commented Mar 10, 2026

View reviewed changes

python/_tskitmodule.c Show resolved Hide resolved

c/tskit/trees.c Show resolved Hide resolved

petrelharp reviewed Mar 14, 2026

View reviewed changes

python/tests/test_ld_matrix.py Outdated Show resolved Hide resolved

petrelharp reviewed Mar 14, 2026

View reviewed changes

python/tests/test_ld_matrix.py Outdated Show resolved Hide resolved

petrelharp reviewed Mar 14, 2026

View reviewed changes

python/tests/test_ld_matrix.py Outdated Show resolved Hide resolved

petrelharp reviewed Mar 14, 2026

View reviewed changes

c/tskit/trees.c Show resolved Hide resolved

petrelharp reviewed Mar 16, 2026

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

petrelharp reviewed Mar 16, 2026

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

petrelharp reviewed Mar 16, 2026

View reviewed changes

lkirk added 10 commits March 16, 2026 21:55

initial stab at a general matrix (no normalisation)

1508884

added dimension dropping, but I think transposing is better -- we don…

b144f8d

…'t have to add a dimension at the end for scalar operations

finalize and add tests for single and multipop

92b422a

turns out, the general norm function needs to know the state_dims

8018550

fix up a bit of naming in general test funcs, remove unneeded branch,…

0351141

… fix norm func for r2_ij

flake8 does not like assigning lambdas to variables

ad920d0

and black doesn't like that

8d22e8e

do not test equality, this was useful on my local machine but is prob…

e5fcd0e

…lematic in practice

lowlevel tests

6bb867f

relax diff requirements (macos failure)

26bebf5

lkirk added 3 commits March 17, 2026 18:33

update comment about result dimension

be997cf

be more explicit about setting the default norm function

9481921

linting does not like assigning lambdas to variables

7387461

petrelharp reviewed Mar 18, 2026

View reviewed changes

python/tests/test_python_c.py Outdated Show resolved Hide resolved

petrelharp reviewed Mar 18, 2026

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

petrelharp reviewed Mar 18, 2026

View reviewed changes

c/tskit/trees.c Show resolved Hide resolved

petrelharp mentioned this pull request Mar 19, 2026

Documentation for LD matrix methods #3416

Open

lkirk added 8 commits March 20, 2026 15:03

add an else statement to improve readability (review)

f5b1940

Add some minimal documentation about the purpose of the two entrypoints

f854aa1

Test explicitly that our internal data is read only

8c49863

Incorporate Peter's improvement to the test comments

7378e42

lkirk commented Mar 22, 2026

View reviewed changes

turns out the documentation build doesn't like result_dims changi…

824faaa

…ng to ``result_dim`` fixes the docs.

		matrix whose rows contain haplotype counts per sample set (counts of AB,
		Ab, aB) and ``n`` is a vector of sample set sizes.

Conversation

lkirk commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

PR Checklist:

Uh oh!

codecov bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

apragsdale commented Mar 13, 2026

Uh oh!

lkirk commented Mar 14, 2026

Uh oh!

petrelharp commented Mar 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lkirk commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary Function

Normalisation

Uh oh!

Uh oh!

petrelharp commented Mar 14, 2026

Uh oh!

lkirk commented Mar 14, 2026

Uh oh!

petrelharp commented Mar 15, 2026

Uh oh!

petrelharp commented Mar 15, 2026

Uh oh!

jeromekelleher commented Mar 16, 2026

Uh oh!

Uh oh!

petrelharp commented Mar 16, 2026

Uh oh!

lkirk commented Mar 16, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lkirk Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lkirk commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Dimensions

Documentation

Uh oh!

lkirk commented Mar 18, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lkirk commented Mar 10, 2026 •

edited

Loading

codecov bot commented Mar 10, 2026 •

edited

Loading

lkirk commented Mar 14, 2026 •

edited

Loading

lkirk Mar 17, 2026 •

edited

Loading

lkirk commented Mar 17, 2026 •

edited

Loading