[ENH] V1 → V2 API Migration - datasets#1608
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1608 +/- ##
===========================================
+ Coverage 52.73% 80.23% +27.49%
===========================================
Files 37 61 +24
Lines 4399 5369 +970
===========================================
+ Hits 2320 4308 +1988
+ Misses 2079 1061 -1018 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
FYI @geetu040 Currently the
Issues:
Example:current def _get_dataset_features_file(did_cache_dir: str | Path | None, dataset_id: int) -> dict[int, OpenMLDataFeature]:
return _featuresOr by updating the Dataset class to use the underlining interface method from api_context directly. def _load_features(self) -> None:
...
self._features = api_context.backend.datasets.get_features(self.dataset_id)Another option is to add |
| def list( | ||
| self, | ||
| limit: int, | ||
| offset: int, | ||
| *, | ||
| data_id: list[int] | None = None, # type: ignore | ||
| **kwargs: Any, | ||
| ) -> pd.DataFrame: ... |
There was a problem hiding this comment.
can we not have same signature for all 3 methods: DatasetsAPI.list, DatasetsV1.list, DatasetsV2.list? does it raise pre-commit failures since a few might not be used?
There was a problem hiding this comment.
Oh that v2 signature was experimental, idk how pre-commits did not catch that, Will make them same
There was a problem hiding this comment.
Is mypy supposed to catch that?
There was a problem hiding this comment.
yes unused parameters are caught under #ARG001 as seen in the cache_directory params.
| def list( | ||
| self, | ||
| limit: int, | ||
| offset: int, | ||
| *, | ||
| data_id: list[int] | None = None, # type: ignore | ||
| **kwargs: Any, | ||
| ) -> pd.DataFrame: |
There was a problem hiding this comment.
you can make this simple using private helper methods
There was a problem hiding this comment.
@geetu040 I could not find a good common block to make an helper since the filters are passed via url in v1 and via json object in v2 , and both have different parsing, If you have any specific Idea on that please let me know
There was a problem hiding this comment.
looking at DatasetsV1 I can think of these helper methods: _build_list_call, _parse_and_validate_xml, _parse_dataset
you can do something similar for DatasetsV2 though they can be different.
There was a problem hiding this comment.
you can do something similar for DatasetsV2 though they can be different.
I see, That opens more options.
openml/_api/resources/datasets.py
Outdated
| bool | ||
| True if the deletion was successful. False otherwise. | ||
| """ | ||
| return openml.utils._delete_entity("data", dataset_id) |
There was a problem hiding this comment.
if you implement the delete logic yourself instead of openml.utils._delete_entity, how would that look? I think it would be better.
There was a problem hiding this comment.
Makes Sense , It would look like a delete request from client along with exception handling
| def list( | ||
| self, | ||
| limit: int, | ||
| offset: int, | ||
| **kwargs: Any, | ||
| ) -> pd.DataFrame: |
There was a problem hiding this comment.
same as above, it can use private helper methods
openml/datasets/functions.py
Outdated
| # Minimalistic check if the XML is useful | ||
| if "oml:data_qualities_list" not in qualities: | ||
| raise ValueError('Error in return XML, does not contain "oml:data_qualities_list"') | ||
| from openml._api import api_context |
There was a problem hiding this comment.
can't we have this import at the very top? does it create circular import error? if not, should be moved to top from all functions.
There was a problem hiding this comment.
It does raise circular import
Thanks for a detailed explanation, I now have good understanding of the download mechanism.
minio can be handled easily, we will use a separate client along with
these are actually different objects in both apis, v1 uses xml and v2 keeps them in json
yes you are right, they are the same files, which are not required to be downloaded again for both versions, but isn't this true for almost all the http objects? they may have different format
I don't understand this point
agreed, should be handled by
agreed, adding in conclusion, I may ask, if we ignore the fact that it downloads the |
|
@geetu040 making a new client for FYI the new commit adds better handling of feature and qualites in OpenMLDataset class moving the v1 specific parsing logic to the interface. So only part left is to handle
|
|
From the standup discussion and earlier conversations, I think we can agree on a few points:
Consider this a green light to experiment with the client design. Try an approach, use whatever caching strategy you think fits best, and aim for a clean, sensible design. Feel free to ask for suggestions or reviews along the way. I'll review it in code. Just make sure this doesn't significantly impact the base classes or other stacked PRs. |
|
The points do make sense to me, I will propose the design along with how it would be used in the resource. |
|
@geetu040 I have a design implemented which needs reviews
Question:
|
I have taken a quick look, the design looks really good, though I have some suggestions/questions in the code, which I will leave in a detailed review. But this in general fixes all our blockers without damaging the original design.
Is it provided by the user? I don't think so. In that case, how does it affect the users? From looking at the code, this cache directory is generated programmatically inside the functions, we can completely remove these lines and always rely on the
|
This makes a sense now. having an independent download method as I have setup is better than updating requests/caching to return path right?
Yes that would work, but the function definition would be changed i.e. tests etc corresponding to them |
I am not sure about that, would require a detailed review
yes, that is expected |
| path = f"data/features/{dataset_id}" | ||
| return self.download_file(path) | ||
|
|
||
| def download_qualities_file(self, dataset_id: int) -> Path: |
There was a problem hiding this comment.
May I know why this function (and similar functions), which are used only in DatasetsV1 and V2, and not called inside functions.py, not private? It becomes hard to distinguish what is actually being called by the SDK and what is being abstracted away.
There was a problem hiding this comment.
It is called in functions.py I haven't updated it yet, was still working on it
openml/_api/http/client.py
Outdated
| parsed_url = urllib.parse.urlparse(source) | ||
|
|
||
| # expect path format: /BUCKET/path/to/file.ext | ||
| _, bucket, *prefixes, _file = parsed_url.path.split("/") |
There was a problem hiding this comment.
_file should be renamed to _ given it is never called, surprised ruff does not call it out.
openml/datasets/functions.py
Outdated
| Parameters | ||
| ---------- | ||
| data_id : list, optional | ||
| dataset_id : list, optional |
openml/datasets/functions.py
Outdated
| def _get_dataset_parquet( | ||
| description: dict | OpenMLDataset, | ||
| cache_directory: Path | None = None, | ||
| cache_directory: Path | None = None, # noqa: ARG001 |
There was a problem hiding this comment.
Why the ARG001 addition, I assume ruff checks were passing before this was added too?
Edit1: Just saw that this is not being used, why not remove it then?
Edit2: This is happening in more than 1 place, I won't mention it in all of them so we can just discuss it here.
There was a problem hiding this comment.
Still under work,This method is now updated to not use the param, working on removing it here and update the corresponding test when starting to write tests for this pr, as mentioned in the meet
| return api_context.backend.datasets.download_dataset_parquet(description, download_all_files) | ||
|
|
||
|
|
||
| def _get_dataset_arff( |
There was a problem hiding this comment.
@geetu040 This function is not being used yet is still present, because it is being used in tests. Should the test involving this be removed, or should this function be kept? (Same scenario exists in tasks migration too)
Edit: I see this happening in multiple places (_get_dataset_features_file and _get_dataset_qualities_file)
| else: | ||
| raise TypeError("`description` should be either OpenMLDataset or Dict.") | ||
|
|
||
| save_cache_directory = ( |
There was a problem hiding this comment.
I assume cache paths for V1 and V2 are different, hence this code is being removed?
There was a problem hiding this comment.
Caching path is removed from here, since handled in the http client, not due to different paths?
| bool | ||
| True if the deletion was successful. False otherwise. | ||
| """ | ||
| return openml.utils._delete_entity("data", dataset_id) |
There was a problem hiding this comment.
@geetu040 @JATAYU000 I remember being on a call and discussing this, didn't we decide to implement this in the base PR (core) itself and leave it out of our migration implementations? If not, I will need to change tasks.
There was a problem hiding this comment.
Yes it was, since it is not yet added I haven't removed this yet
This reverts commit 93155ee.
geetu040
left a comment
There was a problem hiding this comment.
update with #1576 (comment)
|
@geetu040 There is one issue which is still not address in v2 status update |
I though we were going to use that temporary fix for this PR, since this happens only for a very particular case. |
|
But you mentioned earlier that this is not an optimal fix and we should think of a better solution? |
|
@geetu040 Also wwhy are tests on Windows not using Local (OPENML_USE_LOCAL_SERVICES)? openml-python/.github/workflows/test.yml Line 179 in 8a5532f |
Metadata
Reference Issue: [ENH] V1 → V2 API Migration - datasets #1592
Depends on: [ENH] V1 → V2 API Migration - core structure #1576
Change Log Entry:This PR implements Datasets resource, and refactor its existing functions