Skip to content

Dataset download resolutions#206

Open
ethanglaser wants to merge 8 commits intoIntelPython:mainfrom
ethanglaser:dev/eglaser-dataset-download-fixes
Open

Dataset download resolutions#206
ethanglaser wants to merge 8 commits intoIntelPython:mainfrom
ethanglaser:dev/eglaser-dataset-download-fixes

Conversation

@ethanglaser
Copy link
Contributor

@ethanglaser ethanglaser commented Feb 13, 2026

Description

Addresses timeouts and problematic dataset downloads in CI jobs by:

  1. Replacing openml usage from sklearn.datasets.fetch_openml to using openml package directly
  2. Removing epsilon dataset

Checklist:

Completeness and readability

  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes or created a separate PR with updates and provided its number in the description, if necessary.
  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.
  • I have extended testing suite if new functionality was introduced in this PR.

@ethanglaser
Copy link
Contributor Author

@ethanglaser
Copy link
Contributor Author

@ethanglaser
Copy link
Contributor Author

/azp run CI

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ethanglaser ethanglaser marked this pull request as ready for review March 24, 2026 16:11
@ethanglaser ethanglaser changed the title Dev/eglaser dataset download fixes Dataset download resolutions Mar 24, 2026
@david-cortes-intel
Copy link
Contributor

@ethanglaser The CI error:

ModuleNotFoundError: No module named 'openml'

Probably you need to add it to this file:
https://github.com/IntelPython/scikit-learn_bench/blob/main/envs/conda-env-sklearn.yml

@david-cortes-intel
Copy link
Contributor

The CI error:

INFO - sklbench - Report summary
Empty DataFrame
Columns: []
Index: [ElasticNet|fit, ElasticNet|predict, KMeans|fit, KMeans|predict, KMeans|transform, KNeighborsClassifier|fit, KNeighborsClassifier|predict, KNeighborsClassifier|predict_proba, PCA|fit, PCA|transform]

Not sure what's causing it though.

@david-cortes-intel
Copy link
Contributor

Actually this is the error:

ImportError: cannot import name '_check_multi_class' from 'sklearn.linear_model._logistic' (/usr/share/miniconda/envs/bench-env/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py)

But again not sure what's the issue - I see it's using sklearn1.8 with sklearnex2025.11, which should be compatible.

@david-cortes-intel
Copy link
Contributor

david-cortes-intel commented Mar 25, 2026

Digging a bit into it, I think the issue is with this particular line in sklbench:

module = importlib.__import__(module_name, globals(), locals(), [], 0)

It appears to be importing submodules that estimators use into global variables.

Since daal4py defines a module sklearn, it imports that as "sklearn", and then later on when it comes the time to import things from scikit-learn which shares the same module name, it leads to clashes as it tries to look for functions from one in the other.

Apparently the only use for that function is to determine if a particular library contains a given estimator or function by name:

if estimator_name not in classes_map:

if function_name not in functions_map:

Perhaps a quick (but very inefficient) fix could be to do those checks in a subprocess or in a forked process. Actually it's also used to return the estimator class, so it cannot be put into a separate process.

Copy link
Contributor

@david-cortes-intel david-cortes-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, issue with the failing jobs is not related to these fixes and could be left for a later PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants