Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3426 +/- ##
==========================================
+ Coverage 91.92% 91.96% +0.03%
==========================================
Files 37 37
Lines 32153 32354 +201
Branches 5143 5146 +3
==========================================
+ Hits 29556 29753 +197
- Misses 2264 2270 +6
+ Partials 333 331 -2
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
Thank you for opening this, @lkirk. I'm excited to see this implemented! Do we need to do any testing to demonstrate that there are no memory leaks or anything like that, which wouldn't be included in the test suite? I know it could be found in the tests, but it may be helpful to spell out here how the API would work for a two- or more-way stat? Would it be: |
|
Yes, this needs leak checking and documentation, which I can add. I mostly wanted to make sure the user interface made sense first.
The
|
|
Hi, @lkirk! This looks pretty straightforward. To have a careful opinion about the API, I think I need to see a reasonably careful docstring? For instance, what exactly are the arguments to |
|
Hi @petrelharp, thanks for taking a look. I didn't offer much of a description before, does this help? Summary FunctionIn the
The I've been writing summary functions with the signature Where X is a matrix whose rows correspond to sample sets and columns correspond to haplotype counts ( Why lay the data out this way? Because numpy is row-major, iteration over the arrays gives us rows. That makes it easy to select the haplotype counts in one line: Note: under the hood, the data is still laid out in an optimal way (at least as far as I can tell -- see this note) In addition, the sample set sizes is shaped so that we can use it to normalize the counts in one go ( Finally, since we're no longer controlling the summary functions ourselves, the user has control over polarisation (see NormalisationPerhaps the most clunky thing about this api is that the user will need a normalisation function for multiallelic data (norm_f parameter). Its function signature is the same as the code internal to The default normalization function of I would have loved to be able to pass |
|
I've looked at the code pretty carefully, and things look good besides the comments. Nice work! |
|
Hi @petrelharp, I appreciate you taking a look. I've responded to everything with how I plan to resolve the comment. I did have one clarifying question here. I will clean this up tomorrow. |
|
Hi, @lkirk - I went ahead and made a few adjustments to how the examples are chosen for testing; if you see something wrong or disagree, feel free to revert. Rationale:
|
|
Hm, I also changed the tests that pulled out "all_fields" to explicitly call that function. Now I see that we have a fixture defined for that one. Do you mind removing that function call and using the fixture instead? So that just means doing and then using I'd do this but don't want to risk making conflicting edits. |
|
I'm happy to look over this at some stage - please ping me when you'd like input! |
|
Ah, just noticed your expanded description. So, to answer your question: the API looks good. I may have some other comments now, but will wait for you to ping me when it's ready to read carefuly (including the docstring). |
|
thanks @petrelharp, glad to know it seems reasonable. I'm currently running down a bug that appears at first glance to be macos specific (it shook out with my updates). I'm hoping to be done with this in the next few hours, I'll summarize the changes and ping you when ready. For now, you might see a few more spurious pushes until I figure out what's going on. I don't have any way to test this locally on a mac. |
python/tskit/trees.py
Outdated
| return result.reshape(result.shape[:2]) | ||
| # Orient the data so that the first dimension is the sample set so that | ||
| # we get one LD matrix per sample set. | ||
| return result.swapaxes(0, 2).swapaxes(1, 2) |
There was a problem hiding this comment.
I'm confused about this: since the summary function takes in all the sample sets at once, it doesn't have any notion of "one LD matrix per sample set"?
There was a problem hiding this comment.
(and if this is necessary, note that np.moveaxis would do this in a single operation)
There was a problem hiding this comment.
For data locality when computing stats over sample sets, we want the sample set to be the last dimension. However, for usability @apragsdale and I thought it'd be better to have the sample set as the first dimension in the output. This makes it possible to access individual LD matrices with result[0], etc... compared to result[:,:,0]. So, moving the axis will be necessary. I haven't seen np.moveaxis before, the LD matrix also use swapaxes twice and would benefit from this change as well.
In summary, we now move (m, m, k) to (k, m, m) with np.moveaxis, where m is the number of sites/trees and k is the number of sample sets output dimensions.
There was a problem hiding this comment.
But now that I look more closely, say, at diversity
[nav] In [6]: ts.diversity(sample_sets=[A, B], windows="sites", mode="site").shape
Out[6]: (73, 2)
I realize that this decision was a deviation from other stats, where sample set is the last dimension.
There was a problem hiding this comment.
ah. we definitely want this to match the other stats. I had not done that double-checking, but had assumed you had.
There was a problem hiding this comment.
This was a conscious decision with @apragsdale, we thought it was a bit more convenient to access single LD matrices with result[0] instead of result[:,:,0], but it sounds like we should change to the latter (last dimension is sample set). I've chatted with @apragsdale about this and he sees the logic in this change. Since we use the C statistics to test the general stats, if I make the change for our general stat, I'll need to do it across the board. I think it'll only affect a few tests that explicitly test output dimensions.
There was a problem hiding this comment.
This does sound like a pain, and given you've thought about this hard, I could be convinced otherwise!
…'t have to add a dimension at the end for scalar operations
… fix norm func for r2_ij
…lematic in practice
|
@petrelharp, thank you for taking such a close look. I'll summarize what was done since your last review. TestingThe tests have been cleaned up substantially. Your comment about pytest fixtures gave me the idea to put a few tree sequences in their own fixtures, where I assert the properties directly in the fixture itself. I've moved away from hard-coding the number of expected dimensions in the output (python C tests). I wasn't intending on keeping the platform code in the test, but I wanted a single clean commit showing exactly the (bug?) I was running into. My understanding is that most PRNGs are portable across architectures, so it was rather surprising to me that the same exact In The python C test coverage seems to be at a reasonable level at this point. DimensionsI've re-reworked some of the dimension handling code for normalization. The changes you observed before were an attempt to remove the need for DocumentationI've drafted up a docstring for this method, but I'll want to reference a few sections defined in #3416. I can rebase this branch from that one, just to get the definitions I need. If I do that, we'll want #3416 merged before this PR. |
|
@jeromekelleher I think this is ready for a look. I don't anticipate any more changes to the CPython layer, we're still discussing a couple of points, but I think your feedback would be useful at this point. |
python/tskit/trees.py
Outdated
| """ | ||
| Compute two-locus statistics with a user-defined python function that | ||
| operates on haplotype counts. TODO: reference modes in two-locus docs. | ||
| On each pair of sites or trees, the summary function is provided with |
There was a problem hiding this comment.
I think you need to be more specific about the summary function: not just "is provided with" but say what, exactly, is the signature. (e.g., "must accept two arguments: the first, X, is ...")
There was a problem hiding this comment.
I clarified the requirements for summary functions and norm functions in my recent change.
python/tskit/trees.py
Outdated
| matrix whose rows contain haplotype counts per sample set (counts of AB, | ||
| Ab, aB) and ``n`` is a vector of sample set sizes. |
There was a problem hiding this comment.
Say what AB, Ab, and aB are? Or perhaps we can refer to #3416 here?
There was a problem hiding this comment.
hm, #3416 doesn't currently explain what these are
There was a problem hiding this comment.
I typically call these "Haplotype Counts" with A/B as the derived and a/b as the ancestral. Though the ancestral/derived states don't matter in the cases of biallelic sites and branches (biallelic as implemented -- a single node bifurcates "ancestral" and "derived" or vice versa).
The reason that this distinction doesn't matter too much when we're only considering 2 alleles is that LD statistics are symmetric. Swapping ancestral/derived states doesn't change the outcome (I recall testing this a while ago when I was working on normalisation).
That's all to say that A/B and a/b really just represent 2 states (or two alleles). When combined with n (the sample set size), haplotype counts AB, Ab, and aB comprise all of the information we can gather about the linkage of two sites/trees. I'm not exactly sure how much of this to spell out in #3416 or in this docstring.
python/tskit/trees.py
Outdated
|
|
||
| .. note:: | ||
| Because we are operating on very small matrices/vectors, vectorised | ||
| operations are often times slower than operations on scalars. Simply | ||
| returning ``[value]`` can be faster than returning | ||
| ``value[np.newaxis,]`` or ``np.expand_dims(value, 0)``. |
There was a problem hiding this comment.
I don't understand the significance of this? Perhaps be a bit more explicit, if it's a big deal, or omit this?
There was a problem hiding this comment.
(what I don't get is why you'd be returning these things)
There was a problem hiding this comment.
Perhaps this note is better suited for long-form documentation instead of in the docstring -- I've removed it. I'm mostly trying to guide people to "best practices" that I've observed while testing this code.
There's two concerns wrapped up into one here:
- Vectorized numpy array operations in the hot loop on very small vectors/matrices are more expensive than performing scalar operations. The examples in
test_ld_matrix.pymix these a bit for the sake of readability. But, it can be good to know where potential bottlenecks in summary functions will be. Perhaps this is a better fit for the documentation. - The summary functions must return a 1D array with length=
result_dim, these are just different ways of constructing the output array. It turns out that the best way is to simply construct the output array by wrapping it in a python list (i.e.[value]), over usingnp.expand_dims(value, 0).
python/tskit/trees.py
Outdated
| What follows is an example of computing ``D`` from a tree sequence. Many | ||
| more examples can be found in the test suite | ||
| ``test_ld_matrix.py::GeneralStatsFuncs``. Let's begin with our summary |
There was a problem hiding this comment.
Perhaps link to the ld_matrix docs for D? And good idea, but I don't think we want to refer to the test suite in the docstring: if you'd like to leave this pointer in, perhaps put a comment in the function body?
There was a problem hiding this comment.
I removed the reference to GeneralStatsFuncs. I do think people should look at the GeneralStatsFuncs class to get an idea for how to implement summary functions. They're tested and implement all of the C summary functions. I suppose I can directly address this in the documentation that will accompany this code.
python/tskit/trees.py
Outdated
| pB = paB + pAB | ||
| return pAB - (pA * pB) | ||
|
|
||
| ``norm_f`` is a normalisation function used to combine all computed |
There was a problem hiding this comment.
I think we need to give the precise requirements for the norm function here.
There was a problem hiding this comment.
I changed the language to be more precise.
python/tskit/trees.py
Outdated
| will be called. The default normalisation function is identical to | ||
| ``total_norm`` shown in the example below. ``hap_norm`` is required for | ||
| normalising :math:`r^2`. Both of these examples return a numpy array | ||
| with length equal to the number of sample sets (for one-way stats). |
There was a problem hiding this comment.
It's not clear what "for one-way stats" means here.
There was a problem hiding this comment.
Good point, I've written a more general description that relies on the mapping from sample_set->stat->output dim instead of referring to one-way stats, which aren't a good tool for explaining what I mean here.
I've made a change that expresses this more directly I think.
| ) | ||
| # Orient the data so that the first dimension is the sample set so that | ||
| # we get one LD matrix per result dimension | ||
| return np.moveaxis(result, -1, 0) |
There was a problem hiding this comment.
maybe we should do it this way also above at this line? (one call to moveaxis instead of two)
and, the comment here is not right yet: says "first dimension is the sample set"
| ret = (PyObject *) result_matrix; | ||
| result_matrix = NULL; | ||
| out: | ||
| Py_XDECREF(summary_func); |
There was a problem hiding this comment.
Hm, you're decref'ing summary_func but not norm_func; that seems wrong. Have you checked for memory leaks in this code yet? (I need to look up how to do that...)
There was a problem hiding this comment.
I made a pass for memory leaks and found one, I've revised the summary and norm function methods above so that they don't perform intermediate transposes (which were leaking memory -- I believe they're allocating memory for new dimensions, not the underlying data).
I performed some leak checking with/without Py_XDECREF(norm_func) and observed a leak. Fixed in my recent update.
I'll make a comment on the main thread about memory leak checking.
|
Great, I'll have a careful look. Let's see:
No, this is a known thing. Maybe the default RNG is different on macOS? But, not something we're intending to go after.
Sorry, I'm being slow. What's the context here? I'm not seeing any other potential issues? The python-c is hard stuff to wade through; good job getting that working. I won't pretend I understand the ramifications of all that. |
|
I think the failing docs build is just because you need a linebreak after |
Add a test for behavior on empty tree sequences (no samples, no edges, no sites). Add a "no sites" fixture. Include branch stat testing. Tune branch stat test runtime by reducing the size of `ts_100_samp_with_sites_fixture`, now named `ts_10_samp_with_sites_fixture`. Add explicit testing for output dimensions and assert that the norm func is not called on trees with only biallelic sites and in branch mode. Add a GeneralStatNormFuncs class to explicitly document possible normalisation functions and in what situations they will be used. Tune size of tree sequence in `test_general_multi_outputs` so that the test runs in a reasonable amount of time in branch mode.
The transpose operation was creating intermediate data that was not being garbage collected, resulting in a rather obvious memory leak. To mitigate this, I opt to wrap the data in a numpy array that is already transposed. The original data is natively laid out with shape (K,3), by creating a numpy array with shape (3,K) and strides (8,8*K), we can avoid an intermediate transpose operation altogether. After leak-checking again, the memory leak is gone. I also add a Py_XDECREF to remove the reference to `norm_func`, though in my leak checking I don't actually see any difference in RSS or heap size. Finally, mark a few more arrays as read-only, since the C functions that accept them as input annotate these arrays as `const`.
`compute_two_tree_branch_stat` did not check the error returned by the summary function (which is the return value of `compute_two_tree_branch_state_update`.). I caught this because the python callback was setting an exception in the summary function and the python runtime was complaining about an exception being set, despite a successful return status. This also means that failing summary functions would (eventually) be caught, but the code would continue to run. The C python tests did not catch this because we would eventually raise the correct exception.
Provide more precise requirements for `f` and `norm_f` and give some basic understanding of what these functions are and when normalisation will be required. Attempt to fix syntax errors by adding a newline in code blocks. Clarify output dimensions in the code comments (though this might need to change since I think we'll remove the `np.moveaxis` call at the end.
lkirk
left a comment
There was a problem hiding this comment.
@petrelharp Sorry this update took so long.
Updates
Overall, I think I've gotten to every issue that was raised except for the np.swapaxes call at the end of two_locus_count_stat in trees.py. I'll make a separate commit for this, I think this will touch many small things.
Docstring
I made some improvements to the docstring, hopefully this is a bit closer to what you had in mind.
Leak Checking the Python C Code
In my initial memory leak test, I discovered a memory leak which lead to discovering a small bug in the branch stats (we weren't propagating the return status of the summary function -- it's fixed now).
The memory leak that I found was caused by performing a transpose operation before calling the summary function, that created a new container for the data, leaving a dangling reference to the old one. It looks like we're leak free after fixing this change (I also fixed the branch stat bug). I removed the intermediate transpose operation by wrapping the C data with a "pre-transposed" numpy array. I create a new array with PyArray_New so that I can provide the correct dimensions and stride, instead of relying on PyArray_SimpleNewFromData, which doesn't properly set the stride if the dimensions are swapped.
I'll include some selected results from leak checking here, but a more thorough write up can be found in this doc two-locus-general-leak-check.pdf, along with the script I used leak-check.
The following plots show resident set size for the process (rss) and the size of the python heap (heap) for sampled time points. I'm running two_locus_count_stat 10,000 times in a for-loop, recording heap and rss as time progresses (using memray). The document I attached also contains a couple of examples where I induce a memory leak in two_locus_count_stat to show that we can indeed detect large and small memory leaks with this method.
Site Mode
I checked for memory leaks in the side mode with trees that contained biallelic and multiallelic sites. I also produced an error in the summary function to ensure that we're cleaning up properly on error.
Branch Mode
I checked for memory leaks in the branch mode with a much smaller tree, I also included a test case that contains the same summary function error.
| ret = (PyObject *) result_matrix; | ||
| result_matrix = NULL; | ||
| out: | ||
| Py_XDECREF(summary_func); |
There was a problem hiding this comment.
I made a pass for memory leaks and found one, I've revised the summary and norm function methods above so that they don't perform intermediate transposes (which were leaking memory -- I believe they're allocating memory for new dimensions, not the underlying data).
I performed some leak checking with/without Py_XDECREF(norm_func) and observed a leak. Fixed in my recent update.
I'll make a comment on the main thread about memory leak checking.
python/tskit/trees.py
Outdated
|
|
||
| .. note:: | ||
| Because we are operating on very small matrices/vectors, vectorised | ||
| operations are often times slower than operations on scalars. Simply | ||
| returning ``[value]`` can be faster than returning | ||
| ``value[np.newaxis,]`` or ``np.expand_dims(value, 0)``. |
There was a problem hiding this comment.
Perhaps this note is better suited for long-form documentation instead of in the docstring -- I've removed it. I'm mostly trying to guide people to "best practices" that I've observed while testing this code.
There's two concerns wrapped up into one here:
- Vectorized numpy array operations in the hot loop on very small vectors/matrices are more expensive than performing scalar operations. The examples in
test_ld_matrix.pymix these a bit for the sake of readability. But, it can be good to know where potential bottlenecks in summary functions will be. Perhaps this is a better fit for the documentation. - The summary functions must return a 1D array with length=
result_dim, these are just different ways of constructing the output array. It turns out that the best way is to simply construct the output array by wrapping it in a python list (i.e.[value]), over usingnp.expand_dims(value, 0).
python/tskit/trees.py
Outdated
| What follows is an example of computing ``D`` from a tree sequence. Many | ||
| more examples can be found in the test suite | ||
| ``test_ld_matrix.py::GeneralStatsFuncs``. Let's begin with our summary |
There was a problem hiding this comment.
I removed the reference to GeneralStatsFuncs. I do think people should look at the GeneralStatsFuncs class to get an idea for how to implement summary functions. They're tested and implement all of the C summary functions. I suppose I can directly address this in the documentation that will accompany this code.
python/tskit/trees.py
Outdated
| will be called. The default normalisation function is identical to | ||
| ``total_norm`` shown in the example below. ``hap_norm`` is required for | ||
| normalising :math:`r^2`. Both of these examples return a numpy array | ||
| with length equal to the number of sample sets (for one-way stats). |
There was a problem hiding this comment.
Good point, I've written a more general description that relies on the mapping from sample_set->stat->output dim instead of referring to one-way stats, which aren't a good tool for explaining what I mean here.
I've made a change that expresses this more directly I think.
python/tskit/trees.py
Outdated
| matrix whose rows contain haplotype counts per sample set (counts of AB, | ||
| Ab, aB) and ``n`` is a vector of sample set sizes. |
There was a problem hiding this comment.
I typically call these "Haplotype Counts" with A/B as the derived and a/b as the ancestral. Though the ancestral/derived states don't matter in the cases of biallelic sites and branches (biallelic as implemented -- a single node bifurcates "ancestral" and "derived" or vice versa).
The reason that this distinction doesn't matter too much when we're only considering 2 alleles is that LD statistics are symmetric. Swapping ancestral/derived states doesn't change the outcome (I recall testing this a while ago when I was working on normalisation).
That's all to say that A/B and a/b really just represent 2 states (or two alleles). When combined with n (the sample set size), haplotype counts AB, Ab, and aB comprise all of the information we can gather about the linkage of two sites/trees. I'm not exactly sure how much of this to spell out in #3416 or in this docstring.
python/tskit/trees.py
Outdated
| pB = paB + pAB | ||
| return pAB - (pA * pB) | ||
|
|
||
| ``norm_f`` is a normalisation function used to combine all computed |
There was a problem hiding this comment.
I changed the language to be more precise.
python/tskit/trees.py
Outdated
| """ | ||
| Compute two-locus statistics with a user-defined python function that | ||
| operates on haplotype counts. TODO: reference modes in two-locus docs. | ||
| On each pair of sites or trees, the summary function is provided with |
There was a problem hiding this comment.
I clarified the requirements for summary functions and norm functions in my recent change.
…ng to ``result_dim`` fixes the docs.
|
Let me know when you're done here, and I'll take another pass through! Thanks for the careful attention. |
Description
This is the last of the required components for the LD matrix methods. I wanted feedback on the API before I add documentation, but this method is complete and tested. The final things to do are to leak check the cpython code and add some documentation.
This feature enables a user to implement their own two-locus count statistic in python, similar to
ts.sample_count_stat. User functions take two arguments, the first is a matrix of haplotype counts and the second is a vector of sample set sizes. For instance, this is how we would implementDwith this api:Since this API supports multiallelic sites, the user can also pass a normalisation function to control how the data is normalised across multiple alleles. The normalisation function is only run when computing over multiallelic sites. I've set the default to be$1/(n_A n_B)$ , which is simply the arithmetic mean of the alleles in a given pair of sites. This will suffice in the majority of cases (the only outlier is $r^2$ , for which there is already a python API). We also support computing statistics between sample sets.
The user would use the above summary function like this:
Where
1specifies the length of the output array, we always require 1 dimension -- same as thets.sample_count_statfunction.PR Checklist: