Arm backend: Split and decouple model evaluators#18118
Arm backend: Split and decouple model evaluators#18118martinlsm wants to merge 1 commit intopytorch:mainfrom
Conversation
Tidy up the code in arm_model_evaluators.py by: - Making them no longer overlap. For example, `ImageNetEvaluator` no longer carries out numerical evaluation and checks the file compression ratio of the TOSA file; these evaluations are instead carried out solely by `NumericalEvaluator` and `FileCompressionEvaluator` respectively. - Rename `GenericModelEvaluator` to `NumericalModelEvaluator` and make it only evaluate via elementwise numerical comparison between reference and test model. - Add `FileCompressionEvaluator` which measures file compression ratio of a TOSA file. This change makes it easier for a user to deliberately select exactly what measures they want to evaluate for a model. Signed-off-by: Martin Lindström <Martin.Lindstroem@arm.com> Change-Id: Ic98a00409f637264359658eaa17219c86f2520f9
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18118
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 Awaiting Approval, 17 New Failures, 1 Cancelled JobAs of commit eb06c08 with merge base 48bd687 ( AWAITING APPROVAL - The following workflows need approval before CI can run:
NEW FAILURES - The following jobs have failed:
CANCELLED JOB - The following job was cancelled. Please retry:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label ciflow/trunk |
|
@pytorchbot label "partner: arm" |
|
@pytorchbot label "release notes: none" |
There was a problem hiding this comment.
Pull request overview
This PR refactors arm_model_evaluator.py by splitting the previously coupled evaluator classes into independent, single-responsibility evaluators. The monolithic GenericModelEvaluator (with its subclass ImageNetEvaluator) and the orchestration functions (evaluate_model, evaluator_calibration_data) are replaced by three focused classes behind a common Evaluator base.
Changes:
- Renamed
GenericModelEvaluatortoNumericalModelEvaluator, now exclusively computing elementwise numerical error metrics between a reference and test model. - Decoupled
ImageNetEvaluatorfrom numerical evaluation; it now only computes top-1/top-5 accuracy and owns its own dataset loading/transforms. - Added
FileCompressionEvaluatoras a standalone evaluator for TOSA flatbuffer compression ratio.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
backends/arm/util/arm_model_evaluator.py |
Splits evaluators into NumericalModelEvaluator, ImageNetEvaluator, and FileCompressionEvaluator with a shared Evaluator base; removes orchestration code. |
backends/arm/test/misc/test_model_evaluator.py |
Updates tests to use the new evaluator classes and their simplified APIs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Metrics (lists per output tensor): | ||
| * max_error | ||
| * max_absolute_error | ||
| * max_percentage_error (safe-divided; zero ref elements -> 0%) | ||
| * mean_absolute_error | ||
|
|
||
| """ | ||
| if self._eval_dtype is not None: | ||
| eval_inputs = tuple( | ||
| inp.to(self._eval_dtype) for inp in self._example_inputs | ||
| ) | ||
| seed = default_seed | ||
| else: | ||
| seed = default_seed | ||
| rng = random.Random( | ||
| seed | ||
| ) # nosec B311 - deterministic shuffling for evaluation only | ||
| indices = list(range(len(dataset))) | ||
| rng.shuffle(indices) | ||
| selected = sorted(indices[:k]) | ||
| return torch.utils.data.DataLoader( | ||
| torch.utils.data.Subset(dataset, selected), batch_size=1, shuffle=False | ||
| ) | ||
| else: | ||
| eval_inputs = self._example_inputs | ||
|
|
||
| ref_outputs, _ = tree_flatten(self._ref_model(*self._example_inputs)) | ||
| eval_outputs, _ = tree_flatten(self._eval_model(*eval_inputs)) | ||
|
|
||
| def _load_imagenet_folder(directory: str) -> datasets.ImageFolder: | ||
| """Shared helper to load an ImageNet-layout folder. | ||
| metrics = self._get_model_error(ref_outputs, eval_outputs) | ||
|
|
||
| Raises FileNotFoundError for a missing directory early to aid debugging. | ||
| return metrics | ||
|
|
||
| """ | ||
| directory_path = Path(directory) | ||
| if not directory_path.exists(): | ||
| raise FileNotFoundError(f"Directory: {directory} does not exist.") | ||
| transform = _get_imagenet_224_transforms() | ||
| return datasets.ImageFolder(directory_path, transform=transform) | ||
| @staticmethod | ||
| def _get_model_error(ref_outputs, eval_outputs) -> dict[str, Any]: | ||
| metrics = {} | ||
|
|
||
| for ref_output, eval_output in zip(ref_outputs, eval_outputs): | ||
| difference = ref_output - eval_output | ||
| # Avoid divide by zero: elements where ref_output == 0 produce 0% contribution | ||
| percentage_error = torch.where( | ||
| ref_output != 0, | ||
| difference / ref_output * 100, | ||
| torch.zeros_like(difference), | ||
| ) | ||
|
|
||
| metrics["max_error"] = torch.max(difference).item() | ||
| metrics["max_absolute_error"] = torch.max(torch.abs(difference)).item() | ||
| metrics["max_percentage_error"] = torch.max(percentage_error).item() | ||
| metrics["mean_absolute_error"] = torch.mean( | ||
| torch.abs(difference).float() | ||
| ).item() |
There was a problem hiding this comment.
The docstring says "Metrics (lists per output tensor)" but the implementation now stores plain scalars, silently overwriting metrics from earlier outputs when a model produces multiple output tensors. The old code used defaultdict(list) and .append() to accumulate per-output metrics. Either the docstring should be updated to reflect that only the last output's metrics are kept (if that's intentional for single-output models), or the implementation should accumulate metrics across outputs (e.g. by appending to lists or indexing by output position).
Tidy up the code in arm_model_evaluators.py by:
ImageNetEvaluatorno longer carries out numerical evaluation and checks the file compression ratio of the TOSA file; these evaluations are instead carried out solely byNumericalEvaluatorandFileCompressionEvaluatorrespectively.GenericModelEvaluatortoNumericalModelEvaluatorand make it only evaluate via elementwise numerical comparison between reference and test model.FileCompressionEvaluatorwhich measures file compression ratio of a TOSA file.This change makes it easier for a user to deliberately select exactly what measures they want to evaluate for a model.
cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell