Skip to content

Two locus general matrix#3426

Open
lkirk wants to merge 33 commits intotskit-dev:mainfrom
lkirk:two-locus-general-matrix
Open

Two locus general matrix#3426
lkirk wants to merge 33 commits intotskit-dev:mainfrom
lkirk:two-locus-general-matrix

Conversation

@lkirk
Copy link
Contributor

@lkirk lkirk commented Mar 10, 2026

Description

This is the last of the required components for the LD matrix methods. I wanted feedback on the API before I add documentation, but this method is complete and tested. The final things to do are to leak check the cpython code and add some documentation.

This feature enables a user to implement their own two-locus count statistic in python, similar to ts.sample_count_stat. User functions take two arguments, the first is a matrix of haplotype counts and the second is a vector of sample set sizes. For instance, this is how we would implement D with this api:

def D(X, n):
    pAB, pAb, paB = X / n
    pA = pAb + pAB
    pB = paB + pAB
    return pAB - (pA * pB)

Since this API supports multiallelic sites, the user can also pass a normalisation function to control how the data is normalised across multiple alleles. The normalisation function is only run when computing over multiallelic sites. I've set the default to be $1/(n_A n_B)$, which is simply the arithmetic mean of the alleles in a given pair of sites. This will suffice in the majority of cases (the only outlier is $r^2$, for which there is already a python API). We also support computing statistics between sample sets.

The user would use the above summary function like this:

ts.two_locus_count_stat(ts.samples(), D, 1)

Where 1 specifies the length of the output array, we always require 1 dimension -- same as the ts.sample_count_stat function.

PR Checklist:

  • Tests that fully cover new/changed functionality.
  • Documentation including tutorial content if appropriate.
  • Changelogs, if there are API changes.

@codecov
Copy link

codecov bot commented Mar 10, 2026

Codecov Report

❌ Patch coverage is 95.34884% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.96%. Comparing base (afaf3b9) to head (824faaa).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3426      +/-   ##
==========================================
+ Coverage   91.92%   91.96%   +0.03%     
==========================================
  Files          37       37              
  Lines       32153    32354     +201     
  Branches     5143     5146       +3     
==========================================
+ Hits        29556    29753     +197     
- Misses       2264     2270       +6     
+ Partials      333      331       -2     
Flag Coverage Δ
C 82.69% <66.66%> (-0.02%) ⬇️
c-python 77.68% <93.43%> (+0.33%) ⬆️
python-tests 96.40% <100.00%> (+<0.01%) ⬆️
python-tests-no-jit 33.20% <17.64%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Python API 98.70% <100.00%> (+<0.01%) ⬆️
Python C interface 91.35% <95.62%> (+0.11%) ⬆️
C library 88.88% <86.66%> (+0.02%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@apragsdale
Copy link

Thank you for opening this, @lkirk. I'm excited to see this implemented! Do we need to do any testing to demonstrate that there are no memory leaks or anything like that, which wouldn't be included in the test suite?

I know it could be found in the tests, but it may be helpful to spell out here how the API would work for a two- or more-way stat? Would it be:

ts.two_locus_count_stat([sample_list_1, sample_list_2], two_way_stat_func, 2)

@lkirk
Copy link
Contributor Author

lkirk commented Mar 14, 2026

Yes, this needs leak checking and documentation, which I can add. I mostly wanted to make sure the user interface made sense first.

I know it could be found in the tests, but it may be helpful to spell out here how the API would work for a two- or more-way stat? Would it be:

ts.two_locus_count_stat([sample_list_1, sample_list_2], two_way_stat_func, 2)

The result_dim argument to the function gives the dimensions of the summary function output. The output is required to be 1D, so result_dim really tells us the length of the returned vector. In most cases, it'll be 1. See this line in the tests. So, if your summary function returned 1 value for a pair of sites, then the function call will look like this:

ts.two_locus_count_stat([sample_list_1, sample_list_2], two_way_stat_func, 1)

two_locus_count_stat was designed to work like sample_count_stat

@petrelharp
Copy link
Contributor

Hi, @lkirk! This looks pretty straightforward. To have a careful opinion about the API, I think I need to see a reasonably careful docstring? For instance, what exactly are the arguments to f and norm_f? I can probably figure it out by tracing through the code, but it'll be less error-prone if you write it down. I'll have a look through for other issues, but it will be much easier to have the description in words of what exactly it's trying to do.

@lkirk
Copy link
Contributor Author

lkirk commented Mar 14, 2026

Hi @petrelharp, thanks for taking a look. I didn't offer much of a description before, does this help?

Summary Function

In the sample_count_stat api, there are 3 required parameters:

  1. sample_sets: List of lists of node ids.
  2. f: A summary function that takes a one-dimensional array of length equal to the number of sample sets and returns a one-dimensional array.
  3. output_dim: The length of the summary function's return value.

The two_locus_count_stat function (see function signature below) takes the same basic inputs except that f takes a matrix of haplotype counts and a vector of sample set sizes as input.

I've been writing summary functions with the signature f(X, n)

Where X is a matrix whose rows correspond to sample sets and columns correspond to haplotype counts ($w_{Ab}, w_{aB}, w_{AB}$). Here's a sample of what that looks like:

        sample_set 1, sample_set 2
w_Ab  [[9.            9.]
w_aB   [0.            0.]
w_AB   [0.            0.]]

Why lay the data out this way? Because numpy is row-major, iteration over the arrays gives us rows. That makes it easy to select the haplotype counts in one line:

AB, Ab, aB = X

Note: under the hood, the data is still laid out in an optimal way (at least as far as I can tell -- see this note)

In addition, the sample set sizes is shaped so that we can use it to normalize the counts in one go (n is a vector of sample set sizes)

pAB, pAb, paB = X / n

Finally, since we're no longer controlling the summary functions ourselves, the user has control over polarisation (see polarised parameter).

Normalisation

Perhaps the most clunky thing about this api is that the user will need a normalisation function for multiallelic data (norm_f parameter). Its function signature is the same as the code internal to ld_matrix(): f(X, n, nA, nB) -> [out_dims] where X and n are the same, nA and nB are the number of A and B alleles, respectively. Its output dimensions should be the same as our summary function (we also validate this just like we validate the output dims of our summary function).

The default normalization function of two_locus_count_stat is the total (or uniform) normalization, we average all results into one. Here's how I have defined it in the function signature

lambda X, n, nA, nB: np.expand_dims(1 / (nA * nB), axis=0),

I would have loved to be able to pass np.mean as a summary function instead, but we do need haplotype counts to norm $r^2$ and this keeps our original ld_matrix implementation untouched (we use the same exact machinery for this method, so we have to follow the C api).

@petrelharp
Copy link
Contributor

I've looked at the code pretty carefully, and things look good besides the comments. Nice work!

@lkirk
Copy link
Contributor Author

lkirk commented Mar 14, 2026

Hi @petrelharp, I appreciate you taking a look. I've responded to everything with how I plan to resolve the comment. I did have one clarifying question here. I will clean this up tomorrow.

@petrelharp
Copy link
Contributor

Hi, @lkirk - I went ahead and made a few adjustments to how the examples are chosen for testing; if you see something wrong or disagree, feel free to revert. Rationale:

  • Instead of listing "don't run these tests since they're slow" it's now "do run these tests" because otherwise if someone adds a new slow example to get_example_tree_sequence it won't slow things down mysteriously
  • Instead of pulling that one test case out I just call the function that generates it. This makes it a lot more clear what's going on.

@petrelharp
Copy link
Contributor

Hm, I also changed the tests that pulled out "all_fields" to explicitly call that function. Now I see that we have a fixture defined for that one. Do you mind removing that function call and using the fixture instead? So that just means doing

def test_general_one_way_two_locus_stat_multiallelic(stat, ts_fixture):

and then using ts instead of ts_fixture in the code (or renaming it).

I'd do this but don't want to risk making conflicting edits.

@jeromekelleher
Copy link
Member

I'm happy to look over this at some stage - please ping me when you'd like input!

@petrelharp
Copy link
Contributor

Ah, just noticed your expanded description. So, to answer your question: the API looks good. I may have some other comments now, but will wait for you to ping me when it's ready to read carefuly (including the docstring).

@lkirk
Copy link
Contributor Author

lkirk commented Mar 16, 2026

thanks @petrelharp, glad to know it seems reasonable. I'm currently running down a bug that appears at first glance to be macos specific (it shook out with my updates). I'm hoping to be done with this in the next few hours, I'll summarize the changes and ping you when ready. For now, you might see a few more spurious pushes until I figure out what's going on. I don't have any way to test this locally on a mac.

return result.reshape(result.shape[:2])
# Orient the data so that the first dimension is the sample set so that
# we get one LD matrix per sample set.
return result.swapaxes(0, 2).swapaxes(1, 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about this: since the summary function takes in all the sample sets at once, it doesn't have any notion of "one LD matrix per sample set"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and if this is necessary, note that np.moveaxis would do this in a single operation)

Copy link
Contributor Author

@lkirk lkirk Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For data locality when computing stats over sample sets, we want the sample set to be the last dimension. However, for usability @apragsdale and I thought it'd be better to have the sample set as the first dimension in the output. This makes it possible to access individual LD matrices with result[0], etc... compared to result[:,:,0]. So, moving the axis will be necessary. I haven't seen np.moveaxis before, the LD matrix also use swapaxes twice and would benefit from this change as well.

In summary, we now move (m, m, k) to (k, m, m) with np.moveaxis, where m is the number of sites/trees and k is the number of sample sets output dimensions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But now that I look more closely, say, at diversity

[nav] In [6]: ts.diversity(sample_sets=[A, B], windows="sites", mode="site").shape
Out[6]: (73, 2)

I realize that this decision was a deviation from other stats, where sample set is the last dimension.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah. we definitely want this to match the other stats. I had not done that double-checking, but had assumed you had.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a conscious decision with @apragsdale, we thought it was a bit more convenient to access single LD matrices with result[0] instead of result[:,:,0], but it sounds like we should change to the latter (last dimension is sample set). I've chatted with @apragsdale about this and he sees the logic in this change. Since we use the C statistics to test the general stats, if I make the change for our general stat, I'll need to do it across the board. I think it'll only affect a few tests that explicitly test output dimensions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does sound like a pain, and given you've thought about this hard, I could be convinced otherwise!

@lkirk
Copy link
Contributor Author

lkirk commented Mar 17, 2026

@petrelharp, thank you for taking such a close look. I'll summarize what was done since your last review.

Testing

The tests have been cleaned up substantially. Your comment about pytest fixtures gave me the idea to put a few tree sequences in their own fixtures, where I assert the properties directly in the fixture itself. ts_100_samp_with_sites_fixture has the same properties as the n=100_m=32_rho=0.5 example tree sequence, but its definition is now localized to test_ld_matrix.py, so there's less risk of changes to the example tree sequences affecting the general_ tests.

I've moved away from hard-coding the number of expected dimensions in the output (python C tests). I wasn't intending on keeping the platform code in the test, but I wanted a single clean commit showing exactly the (bug?) I was running into. My understanding is that most PRNGs are portable across architectures, so it was rather surprising to me that the same exact msprime simulation produces different tree sequences (my best guess without digging into msprime is that there may be some floating point jitter in rejection sampling? -- unless my assumptions about PRNGs are incorrect). All this being said, do you think it's appropriate to open an issue in msprime?

In test_python_c.py, I've split the biallelic and multiallelic test cases to isolate the testing of the normalization function. It's also nice to have one tree sequence defined and one set of happy-path inputs.

The python C test coverage seems to be at a reasonable level at this point.

Dimensions

I've re-reworked some of the dimension handling code for normalization. The changes you observed before were an attempt to remove the need for expand_dims in a lot of places -- adopting the philosophy of propagating the dimensions through to the end. I did a couple of microbenchmarks (happy to share these if you want) to simply decide the right thing to do here. It turns out that [value] is both the fastest and most readable. I think putting vectorized operations and/or fancy indexing operations in the hot loop adds an appreciable amount of overhead. I worked to get these summary functions looking as idiomatic as possible so that they can be used as a reference. There are definitely small changes that can be made to make them faster, but I wanted to balance readability. This balance is perhaps more critical here than in other places because computational cost adds up quickly in the two-locus methods.

Documentation

I've drafted up a docstring for this method, but I'll want to reference a few sections defined in #3416. I can rebase this branch from that one, just to get the definitions I need. If I do that, we'll want #3416 merged before this PR.

@lkirk
Copy link
Contributor Author

lkirk commented Mar 18, 2026

@jeromekelleher I think this is ready for a look. I don't anticipate any more changes to the CPython layer, we're still discussing a couple of points, but I think your feedback would be useful at this point.

"""
Compute two-locus statistics with a user-defined python function that
operates on haplotype counts. TODO: reference modes in two-locus docs.
On each pair of sites or trees, the summary function is provided with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to be more specific about the summary function: not just "is provided with" but say what, exactly, is the signature. (e.g., "must accept two arguments: the first, X, is ...")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I clarified the requirements for summary functions and norm functions in my recent change.

Comment on lines +10951 to +10952
matrix whose rows contain haplotype counts per sample set (counts of AB,
Ab, aB) and ``n`` is a vector of sample set sizes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say what AB, Ab, and aB are? Or perhaps we can refer to #3416 here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, #3416 doesn't currently explain what these are

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I typically call these "Haplotype Counts" with A/B as the derived and a/b as the ancestral. Though the ancestral/derived states don't matter in the cases of biallelic sites and branches (biallelic as implemented -- a single node bifurcates "ancestral" and "derived" or vice versa).

The reason that this distinction doesn't matter too much when we're only considering 2 alleles is that LD statistics are symmetric. Swapping ancestral/derived states doesn't change the outcome (I recall testing this a while ago when I was working on normalisation).

That's all to say that A/B and a/b really just represent 2 states (or two alleles). When combined with n (the sample set size), haplotype counts AB, Ab, and aB comprise all of the information we can gather about the linkage of two sites/trees. I'm not exactly sure how much of this to spell out in #3416 or in this docstring.

Comment on lines +10953 to +10958

.. note::
Because we are operating on very small matrices/vectors, vectorised
operations are often times slower than operations on scalars. Simply
returning ``[value]`` can be faster than returning
``value[np.newaxis,]`` or ``np.expand_dims(value, 0)``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the significance of this? Perhaps be a bit more explicit, if it's a big deal, or omit this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(what I don't get is why you'd be returning these things)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this note is better suited for long-form documentation instead of in the docstring -- I've removed it. I'm mostly trying to guide people to "best practices" that I've observed while testing this code.

There's two concerns wrapped up into one here:

  1. Vectorized numpy array operations in the hot loop on very small vectors/matrices are more expensive than performing scalar operations. The examples in test_ld_matrix.py mix these a bit for the sake of readability. But, it can be good to know where potential bottlenecks in summary functions will be. Perhaps this is a better fit for the documentation.
  2. The summary functions must return a 1D array with length=result_dim, these are just different ways of constructing the output array. It turns out that the best way is to simply construct the output array by wrapping it in a python list (i.e. [value]), over using np.expand_dims(value, 0).

Comment on lines +10960 to +10962
What follows is an example of computing ``D`` from a tree sequence. Many
more examples can be found in the test suite
``test_ld_matrix.py::GeneralStatsFuncs``. Let's begin with our summary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps link to the ld_matrix docs for D? And good idea, but I don't think we want to refer to the test suite in the docstring: if you'd like to leave this pointer in, perhaps put a comment in the function body?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the reference to GeneralStatsFuncs. I do think people should look at the GeneralStatsFuncs class to get an idea for how to implement summary functions. They're tested and implement all of the C summary functions. I suppose I can directly address this in the documentation that will accompany this code.

pB = paB + pAB
return pAB - (pA * pB)

``norm_f`` is a normalisation function used to combine all computed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to give the precise requirements for the norm function here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the language to be more precise.

will be called. The default normalisation function is identical to
``total_norm`` shown in the example below. ``hap_norm`` is required for
normalising :math:`r^2`. Both of these examples return a numpy array
with length equal to the number of sample sets (for one-way stats).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear what "for one-way stats" means here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I've written a more general description that relies on the mapping from sample_set->stat->output dim instead of referring to one-way stats, which aren't a good tool for explaining what I mean here.

I've made a change that expresses this more directly I think.

)
# Orient the data so that the first dimension is the sample set so that
# we get one LD matrix per result dimension
return np.moveaxis(result, -1, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should do it this way also above at this line? (one call to moveaxis instead of two)

and, the comment here is not right yet: says "first dimension is the sample set"

ret = (PyObject *) result_matrix;
result_matrix = NULL;
out:
Py_XDECREF(summary_func);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, you're decref'ing summary_func but not norm_func; that seems wrong. Have you checked for memory leaks in this code yet? (I need to look up how to do that...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a pass for memory leaks and found one, I've revised the summary and norm function methods above so that they don't perform intermediate transposes (which were leaking memory -- I believe they're allocating memory for new dimensions, not the underlying data).

I performed some leak checking with/without Py_XDECREF(norm_func) and observed a leak. Fixed in my recent update.

I'll make a comment on the main thread about memory leak checking.

@petrelharp
Copy link
Contributor

Great, I'll have a careful look. Let's see:

  • Should there be a test that X is actually read-only?
  • I think you've not got a test that running on a ts with no sites in site mode gives the correct thing (and doesn't error)?
  • I think I see why we need both tsk_treeseq_two_locus_count_stat and tsk_treeseq_two_locus_count_general_stat; I think some explanatory comments about the difference would be very helpful here.

do you think it's appropriate to open an issue in msprime?

No, this is a known thing. Maybe the default RNG is different on macOS? But, not something we're intending to go after.

It turns out that [value] is both the fastest and most readable.

Sorry, I'm being slow. What's the context here?

I'm not seeing any other potential issues? The python-c is hard stuff to wade through; good job getting that working. I won't pretend I understand the ramifications of all that.

@petrelharp
Copy link
Contributor

I think the failing docs build is just because you need a linebreak after .. code-block:: python?

lkirk added 8 commits March 20, 2026 15:03
Add a test for behavior on empty tree sequences (no samples, no edges,
no sites). Add a "no sites" fixture.

Include branch stat testing. Tune branch stat test runtime by reducing
the size of `ts_100_samp_with_sites_fixture`, now named
`ts_10_samp_with_sites_fixture`.

Add explicit testing for output dimensions and assert that the norm
func is not called on trees with only biallelic sites and in branch
mode.

Add a GeneralStatNormFuncs class to explicitly document possible
normalisation functions and in what situations they will be used.

Tune size of tree sequence in `test_general_multi_outputs` so that the
test runs in a reasonable amount of time in branch mode.
The transpose operation was creating intermediate data that was not
being garbage collected, resulting in a rather obvious memory leak. To
mitigate this, I opt to wrap the data in a numpy array that is already
transposed. The original data is natively laid out with shape (K,3), by
creating a numpy array with shape (3,K) and strides (8,8*K), we can
avoid an intermediate transpose operation altogether. After
leak-checking again, the memory leak is gone.

I also add a Py_XDECREF to remove the reference to `norm_func`, though
in my leak checking I don't actually see any difference in RSS or heap
size.

Finally, mark a few more arrays as read-only, since the C functions that
accept them as input annotate these arrays as `const`.
`compute_two_tree_branch_stat` did not check the error returned by the
summary function (which is the return value of
`compute_two_tree_branch_state_update`.).

I caught this because the python callback was setting an exception in
the summary function and the python runtime was complaining about an
exception being set, despite a successful return status. This also means
that failing summary functions would (eventually) be caught, but the
code would continue to run.

The C python tests did not catch this because we would eventually raise
the correct exception.
Provide more precise requirements for `f` and `norm_f` and give some
basic understanding of what these functions are and when normalisation
will be required.

Attempt to fix syntax errors by adding a newline in code blocks.

Clarify output dimensions in the code comments (though this might need
to change since I think we'll remove the `np.moveaxis` call at the end.
Copy link
Contributor Author

@lkirk lkirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@petrelharp Sorry this update took so long.

Updates

Overall, I think I've gotten to every issue that was raised except for the np.swapaxes call at the end of two_locus_count_stat in trees.py. I'll make a separate commit for this, I think this will touch many small things.

Docstring

I made some improvements to the docstring, hopefully this is a bit closer to what you had in mind.

Leak Checking the Python C Code

In my initial memory leak test, I discovered a memory leak which lead to discovering a small bug in the branch stats (we weren't propagating the return status of the summary function -- it's fixed now).

The memory leak that I found was caused by performing a transpose operation before calling the summary function, that created a new container for the data, leaving a dangling reference to the old one. It looks like we're leak free after fixing this change (I also fixed the branch stat bug). I removed the intermediate transpose operation by wrapping the C data with a "pre-transposed" numpy array. I create a new array with PyArray_New so that I can provide the correct dimensions and stride, instead of relying on PyArray_SimpleNewFromData, which doesn't properly set the stride if the dimensions are swapped.

I'll include some selected results from leak checking here, but a more thorough write up can be found in this doc two-locus-general-leak-check.pdf, along with the script I used leak-check.

The following plots show resident set size for the process (rss) and the size of the python heap (heap) for sampled time points. I'm running two_locus_count_stat 10,000 times in a for-loop, recording heap and rss as time progresses (using memray). The document I attached also contains a couple of examples where I induce a memory leak in two_locus_count_stat to show that we can indeed detect large and small memory leaks with this method.

Site Mode

I checked for memory leaks in the side mode with trees that contained biallelic and multiallelic sites. I also produced an error in the summary function to ensure that we're cleaning up properly on error.

Image

Branch Mode

I checked for memory leaks in the branch mode with a much smaller tree, I also included a test case that contains the same summary function error.

Image

ret = (PyObject *) result_matrix;
result_matrix = NULL;
out:
Py_XDECREF(summary_func);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a pass for memory leaks and found one, I've revised the summary and norm function methods above so that they don't perform intermediate transposes (which were leaking memory -- I believe they're allocating memory for new dimensions, not the underlying data).

I performed some leak checking with/without Py_XDECREF(norm_func) and observed a leak. Fixed in my recent update.

I'll make a comment on the main thread about memory leak checking.

Comment on lines +10953 to +10958

.. note::
Because we are operating on very small matrices/vectors, vectorised
operations are often times slower than operations on scalars. Simply
returning ``[value]`` can be faster than returning
``value[np.newaxis,]`` or ``np.expand_dims(value, 0)``.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this note is better suited for long-form documentation instead of in the docstring -- I've removed it. I'm mostly trying to guide people to "best practices" that I've observed while testing this code.

There's two concerns wrapped up into one here:

  1. Vectorized numpy array operations in the hot loop on very small vectors/matrices are more expensive than performing scalar operations. The examples in test_ld_matrix.py mix these a bit for the sake of readability. But, it can be good to know where potential bottlenecks in summary functions will be. Perhaps this is a better fit for the documentation.
  2. The summary functions must return a 1D array with length=result_dim, these are just different ways of constructing the output array. It turns out that the best way is to simply construct the output array by wrapping it in a python list (i.e. [value]), over using np.expand_dims(value, 0).

Comment on lines +10960 to +10962
What follows is an example of computing ``D`` from a tree sequence. Many
more examples can be found in the test suite
``test_ld_matrix.py::GeneralStatsFuncs``. Let's begin with our summary
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the reference to GeneralStatsFuncs. I do think people should look at the GeneralStatsFuncs class to get an idea for how to implement summary functions. They're tested and implement all of the C summary functions. I suppose I can directly address this in the documentation that will accompany this code.

will be called. The default normalisation function is identical to
``total_norm`` shown in the example below. ``hap_norm`` is required for
normalising :math:`r^2`. Both of these examples return a numpy array
with length equal to the number of sample sets (for one-way stats).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I've written a more general description that relies on the mapping from sample_set->stat->output dim instead of referring to one-way stats, which aren't a good tool for explaining what I mean here.

I've made a change that expresses this more directly I think.

Comment on lines +10951 to +10952
matrix whose rows contain haplotype counts per sample set (counts of AB,
Ab, aB) and ``n`` is a vector of sample set sizes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I typically call these "Haplotype Counts" with A/B as the derived and a/b as the ancestral. Though the ancestral/derived states don't matter in the cases of biallelic sites and branches (biallelic as implemented -- a single node bifurcates "ancestral" and "derived" or vice versa).

The reason that this distinction doesn't matter too much when we're only considering 2 alleles is that LD statistics are symmetric. Swapping ancestral/derived states doesn't change the outcome (I recall testing this a while ago when I was working on normalisation).

That's all to say that A/B and a/b really just represent 2 states (or two alleles). When combined with n (the sample set size), haplotype counts AB, Ab, and aB comprise all of the information we can gather about the linkage of two sites/trees. I'm not exactly sure how much of this to spell out in #3416 or in this docstring.

pB = paB + pAB
return pAB - (pA * pB)

``norm_f`` is a normalisation function used to combine all computed
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the language to be more precise.

"""
Compute two-locus statistics with a user-defined python function that
operates on haplotype counts. TODO: reference modes in two-locus docs.
On each pair of sites or trees, the summary function is provided with
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I clarified the requirements for summary functions and norm functions in my recent change.

@petrelharp
Copy link
Contributor

Let me know when you're done here, and I'll take another pass through! Thanks for the careful attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants