diff --git a/README.md b/README.md index 7c133926..7cdfc78d 100644 --- a/README.md +++ b/README.md @@ -14,6 +14,7 @@ system debug. - ['run-plugins' sub command](#run-plugins-sub-command) - ['gen-plugin-config' sub command](#gen-plugin-config-sub-command) - ['compare-runs' subcommand](#compare-runs-subcommand) + - ['show-redfish-oem-allowable' subcommand](#show-redfish-oem-allowable-subcommand) - ['summary' sub command](#summary-sub-command) - [Configs](#configs) - [Global args](#global-args) @@ -85,7 +86,7 @@ options: --sys-platform STRING Specify system platform (default: None) --plugin-configs [STRING ...] - built-in config names or paths to plugin config JSONs. Available built-in configs: NodeStatus (default: None) + built-in config names or paths to plugin config JSONs. Available built-in configs: AllPlugins, NodeStatus (default: None) --system-config STRING Path to system config json (default: None) --connection-config STRING @@ -116,6 +117,8 @@ node-scraper --sys-name --sys-location REMOTE --connection-config ##### Example: connection_config.json +In-band (SSH) connection: + ```json { "InBandConnectionManager": { @@ -128,6 +131,24 @@ node-scraper --sys-name --sys-location REMOTE --connection-config } ``` +Redfish (BMC) connection for Redfish-only plugins: + +```json +{ + "RedfishConnectionManager": { + "host": "bmc.example.com", + "port": 443, + "username": "admin", + "password": "secret", + "use_https": true, + "verify_ssl": true, + "api_root": "redfish/v1" + } +} +``` + +- `api_root` (optional): Redfish API path (e.g. `redfish/v1`). If omitted, the default `redfish/v1` is used. Override this when your BMC uses a different API version path. + **Notes:** - If using SSH keys, specify `key_filename` instead of `password`. - The remote user must have permissions to run the requested plugins and access required files. If needed, use the `--skip-sudo` argument to skip plugins requiring sudo. @@ -319,6 +340,112 @@ node-scraper compare-runs path1 path2 --include-plugins DmesgPlugin --dont-trunc You can pass multiple plugin names to `--skip-plugins` or `--include-plugins`. +#### **'show-redfish-oem-allowable' subcommand** +The `show-redfish-oem-allowable` subcommand fetches the list of OEM diagnostic types supported by your BMC (from the Redfish LogService `OEMDiagnosticDataType@Redfish.AllowableValues`). Use it to discover which types you can put in `oem_diagnostic_types_allowable` and `oem_diagnostic_types` in the Redfish OEM diag plugin config. + +**Requirements:** A Redfish connection config (same as for RedfishOemDiagPlugin). + +**Command:** +```sh +node-scraper --connection-config connection-config.json show-redfish-oem-allowable --log-service-path "redfish/v1/Systems/UBB/LogServices/DiagLogs" +``` + +Output is a JSON array of allowable type names (e.g. `["Dmesg", "JournalControl", "AllLogs", ...]`). Copy that list into your plugin config’s `oem_diagnostic_types_allowable` if you want to match your BMC. + +**Redfish OEM diag plugin config example** + +Use a plugin config that points at your LogService and lists the types to collect. Logs are written under the run log path (see `--log-path`). + +```json +{ + "name": "Redfish OEM diagnostic logs", + "desc": "Collect OEM diagnostic logs from Redfish LogService. Requires Redfish connection config.", + "global_args": {}, + "plugins": { + "RedfishOemDiagPlugin": { + "collection_args": { + "log_service_path": "redfish/v1/Systems/UBB/LogServices/DiagLogs", + "oem_diagnostic_types_allowable": [ + "JournalControl", + ... + "AllLogs", + ], + "oem_diagnostic_types": ["JournalControl", "AllLogs"], + "task_timeout_s": 600 + }, + "analysis_args": { + "require_all_success": false + } + } + }, + "result_collators": {} +} +``` + +- **`log_service_path`**: Redfish path to the LogService (e.g. DiagLogs). Must match your system (e.g. `UBB` vs. another system id). +- **`oem_diagnostic_types_allowable`**: Full list of types the BMC supports (from `show-redfish-oem-allowable` or vendor docs). +- **`oem_diagnostic_types`**: Subset of types to collect on each run (e.g. `["JournalControl", "AllLogs"]`). +- **`task_timeout_s`**: Max seconds to wait per collection task. + +**How to use** + +1. **Discover allowable types** (optional): run `show-redfish-oem-allowable` and paste the output into `oem_diagnostic_types_allowable` in your plugin config. +2. **Set `oem_diagnostic_types`** to the list you want to collect (e.g. `["JournalControl", "AllLogs"]`). +3. **Run the plugin** with a Redfish connection config and your plugin config: + ```sh + node-scraper --connection-config connection-config.json --plugin-config plugin_config_redfish_oem_diag.json run-plugins RedfishOemDiagPlugin + ``` +4. Use **`--log-path`** to choose where run logs (and OEM diag archives) are written; archives go under `/scraper_logs__/redfish_oem_diag_plugin/redfish_oem_diag_collector/`. + +#### **RedfishEndpointPlugin** + +The RedfishEndpointPlugin collects Redfish URIs (GET responses) and optionally runs checks on the returned JSON. It requires a Redfish connection config (same as RedfishOemDiagPlugin). + +**How to run** + +1. Create a connection config (e.g. `connection-config.json`) with `RedfishConnectionManager` and your BMC host, credentials, and API root. +2. Create a plugin config with `uris` to collect and optional `checks` for analysis (see example below). For example save as `plugin_config_redfish_endpoint.json`. +3. Run: + ```sh + node-scraper --connection-config connection-config.json --plugin-config plugin_config_redfish_endpoint.json run-plugins RedfishEndpointPlugin + ``` + +**Sample plugin config** (`plugin_config_redfish_endpoint.json`): + +```json +{ + "name": "RedfishEndpointPlugin", + "desc": "Redfish endpoint: collect URIs and optional checks", + "global_args": {}, + "plugins": { + "RedfishEndpointPlugin": { + "collection_args": { + "uris": [ + "/redfish/v1/", + "/redfish/v1/Systems/1", + "/redfish/v1/Chassis/1/Power" + ] + }, + "analysis_args": { + "checks": { + "/redfish/v1/Systems/1": { + "PowerState": "On", + "Status/Health": { "anyOf": ["OK", "Warning"] } + }, + "/redfish/v1/Chassis/1/Power": { + "PowerControl/0/PowerConsumedWatts": { "max": 1000 } + } + } + } + } + }, + "result_collators": {} +} +``` + +- **`uris`**: List of Redfish paths (e.g. `/redfish/v1/`, `/redfish/v1/Systems/1`) to GET and store. +- **`checks`**: Optional. Map of URI to expected values or constraints for analysis. Supports exact match (e.g. `"PowerState": "On"`), `anyOf`, `min`/`max`, etc. + #### **'summary' sub command** The 'summary' subcommand can be used to combine results from multiple runs of node-scraper to a single summary.csv file. Sample run: @@ -370,7 +497,11 @@ Below is an example that skips sudo requiring plugins and disables analysis. ``` #### Plugin config: **'--plugin-configs' command** -A plugin config can be used to compare the system data against the config specifications: +A plugin config can be used to compare the system data against the config specifications. +Built-in configs include **NodeStatus** (a subset of plugins) and **AllPlugins** (runs every +registered plugin with default arguments—useful for generating a reference config from the full system). + +Using a JSON file: ```sh node-scraper --plugin-configs plugin_config.json ``` @@ -431,10 +562,16 @@ Here is an example of a comprehensive plugin config that specifies analyzer args This command can be used to generate a reference config that is populated with current system configurations. Plugins that use analyzer args (where applicable) will be populated with system data. -Sample command: + +**Run all registered plugins (AllPlugins config):** ```sh -node-scraper --gen-reference-config run-plugins BiosPlugin OsPlugin +node-scraper --plugin-config AllPlugins +``` + +**Generate a reference config for specific plugins:** +```sh +node-scraper --gen-reference-config run-plugins BiosPlugin OsPlugin ``` This will generate the following config: ```json diff --git a/docs/PLUGIN_DOC.md b/docs/PLUGIN_DOC.md index a7ccc4d4..921045bc 100644 --- a/docs/PLUGIN_DOC.md +++ b/docs/PLUGIN_DOC.md @@ -2,32 +2,34 @@ # Plugin Table -| Plugin | Collection | Analysis | DataModel | Collector | Analyzer | -| --- | --- | --- | --- | --- | --- | -| AmdSmiPlugin | firmware --json
list --json
partition --json
process --json
ras --cper --folder={folder}
ras --afid --cper-file {cper_file}
static -g all --json
static -g {gpu_id} --json
version --json | **Analyzer Args:**
- `check_static_data`: bool
- `expected_gpu_processes`: Optional[int]
- `expected_max_power`: Optional[int]
- `expected_driver_version`: Optional[str]
- `expected_memory_partition_mode`: Optional[str]
- `expected_compute_partition_mode`: Optional[str]
- `expected_pldm_version`: Optional[str]
- `l0_to_recovery_count_error_threshold`: Optional[int]
- `l0_to_recovery_count_warning_threshold`: Optional[int]
- `vendorid_ep`: Optional[str]
- `vendorid_ep_vf`: Optional[str]
- `devid_ep`: Optional[str]
- `devid_ep_vf`: Optional[str]
- `sku_name`: Optional[str]
- `expected_xgmi_speed`: Optional[list[float]]
- `analysis_range_start`: Optional[datetime.datetime]
- `analysis_range_end`: Optional[datetime.datetime] | [AmdSmiDataModel](#AmdSmiDataModel-Model) | [AmdSmiCollector](#Collector-Class-AmdSmiCollector) | [AmdSmiAnalyzer](#Data-Analyzer-Class-AmdSmiAnalyzer) | -| BiosPlugin | sh -c 'cat /sys/devices/virtual/dmi/id/bios_version'
wmic bios get SMBIOSBIOSVersion /Value | **Analyzer Args:**
- `exp_bios_version`: list[str]
- `regex_match`: bool | [BiosDataModel](#BiosDataModel-Model) | [BiosCollector](#Collector-Class-BiosCollector) | [BiosAnalyzer](#Data-Analyzer-Class-BiosAnalyzer) | -| CmdlinePlugin | cat /proc/cmdline | **Analyzer Args:**
- `required_cmdline`: Union[str, List]
- `banned_cmdline`: Union[str, List]
- `os_overrides`: Dict[str, nodescraper.plugins.inband.cmdline.cmdlineconfig.OverrideConfig]
- `platform_overrides`: Dict[str, nodescraper.plugins.inband.cmdline.cmdlineconfig.OverrideConfig] | [CmdlineDataModel](#CmdlineDataModel-Model) | [CmdlineCollector](#Collector-Class-CmdlineCollector) | [CmdlineAnalyzer](#Data-Analyzer-Class-CmdlineAnalyzer) | -| DeviceEnumerationPlugin | powershell -Command "(Get-WmiObject -Class Win32_Processor \| Measure-Object).Count"
lspci -d {vendorid_ep}: \| grep -i 'VGA\\|Display\\|3D' \| wc -l
powershell -Command "(wmic path win32_VideoController get name \| findstr AMD \| Measure-Object).Count"
lscpu
lshw
lspci -d {vendorid_ep}: \| grep -i 'Virtual Function' \| wc -l
powershell -Command "(Get-VMHostPartitionableGpu \| Measure-Object).Count" | **Analyzer Args:**
- `cpu_count`: Optional[list[int]]
- `gpu_count`: Optional[list[int]]
- `vf_count`: Optional[list[int]] | [DeviceEnumerationDataModel](#DeviceEnumerationDataModel-Model) | [DeviceEnumerationCollector](#Collector-Class-DeviceEnumerationCollector) | [DeviceEnumerationAnalyzer](#Data-Analyzer-Class-DeviceEnumerationAnalyzer) | -| DimmPlugin | sh -c 'dmidecode -t 17 \| tr -s " " \| grep -v "Volatile\\|None\\|Module" \| grep Size' 2>/dev/null
dmidecode
wmic memorychip get Capacity | - | [DimmDataModel](#DimmDataModel-Model) | [DimmCollector](#Collector-Class-DimmCollector) | - | -| DkmsPlugin | dkms status
dkms --version | **Analyzer Args:**
- `dkms_status`: Union[str, list]
- `dkms_version`: Union[str, list]
- `regex_match`: bool | [DkmsDataModel](#DkmsDataModel-Model) | [DkmsCollector](#Collector-Class-DkmsCollector) | [DkmsAnalyzer](#Data-Analyzer-Class-DkmsAnalyzer) | -| DmesgPlugin | dmesg --time-format iso -x
ls -1 /var/log/dmesg* 2>/dev/null \| grep -E '^/var/log/dmesg(\.[0-9]+(\.gz)?)?$' \|\| true | **Built-in Regexes:**
- Out of memory error: `(?:oom_kill_process.*)\|(?:Out of memory.*)`
- I/O Page Fault: `IO_PAGE_FAULT`
- Kernel Panic: `\bkernel panic\b.*`
- SQ Interrupt: `sq_intr`
- SRAM ECC: `sram_ecc.*`
- Failed to load driver. IP hardware init error.: `\[amdgpu\]\] \*ERROR\* hw_init of IP block.*`
- Failed to load driver. IP software init error.: `\[amdgpu\]\] \*ERROR\* sw_init of IP block.*`
- Real Time throttling activated: `sched: RT throttling activated.*`
- RCU preempt detected stalls: `rcu_preempt detected stalls.*`
- RCU preempt self-detected stall: `rcu_preempt self-detected stall.*`
- QCM fence timeout: `qcm fence wait loop timeout.*`
- General protection fault: `(?:[\w-]+(?:\[[0-9.]+\])?\s+)?general protectio...`
- Segmentation fault: `(?:segfault.*in .*\[)\|(?:[Ss]egmentation [Ff]au...`
- Failed to disallow cf state: `amdgpu: Failed to disallow cf state.*`
- Failed to terminate tmr: `\*ERROR\* Failed to terminate tmr.*`
- Suspend of IP block failed: `\*ERROR\* suspend of IP block <\w+> failed.*`
- amdgpu Page Fault: `(amdgpu \w{4}:\w{2}:\w{2}\.\w:\s+amdgpu:\s+\[\S...`
- Page Fault: `page fault for address.*`
- Fatal error during GPU init: `(?:amdgpu)(.*Fatal error during GPU init)\|(Fata...`
- PCIe AER Error Status: `(pcieport [\w:.]+: AER: aer_status:[^\n]*(?:\n[...`
- PCIe AER Correctable Error Status: `(.*aer_cor_status: 0x[0-9a-fA-F]+, aer_cor_mask...`
- PCIe AER Uncorrectable Error Status: `(.*aer_uncor_status: 0x[0-9a-fA-F]+, aer_uncor_...`
- PCIe AER Uncorrectable Error Severity with TLP Header: `(.*aer_uncor_severity: 0x[0-9a-fA-F]+.*)(\n.*TL...`
- Failed to read journal file: `Failed to read journal file.*`
- Journal file corrupted or uncleanly shut down: `journal corrupted or uncleanly shut down.*`
- ACPI BIOS Error: `ACPI BIOS Error`
- ACPI Error: `ACPI Error`
- Filesystem corrupted!: `EXT4-fs error \(device .*\):`
- Error in buffered IO, check filesystem integrity: `(Buffer I\/O error on dev)(?:ice)? (\w+)`
- PCIe card no longer present: `pcieport (\w+:\w+:\w+\.\w+):\s+(\w+):\s+(Slot\(...`
- PCIe Link Down: `pcieport (\w+:\w+:\w+\.\w+):\s+(\w+):\s+(Slot\(...`
- Mismatched clock configuration between PCIe device and host: `pcieport (\w+:\w+:\w+\.\w+):\s+(\w+):\s+(curren...`
- RAS Correctable Error: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....`
- RAS Uncorrectable Error: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....`
- RAS Deferred Error: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....`
- RAS Corrected PCIe Error: `((?:\[Hardware Error\]:\s+)?event severity: cor...`
- GPU Reset: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....`
- GPU reset failed: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....`
- ACA Error: `(Accelerator Check Architecture[^\n]*)(?:\n[^\n...`
- ACA Error: `(Accelerator Check Architecture[^\n]*)(?:\n[^\n...`
- MCE Error: `\[Hardware Error\]:.+MC\d+_STATUS.*(?:\n.*){0,5}`
- Mode 2 Reset Failed: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)? (...`
- RAS Corrected Error: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....`
- SGX Error: `x86/cpu: SGX disabled by BIOS`
- MMP Error: `Failed to load MMP firmware qat_4xxx_mmp.bin`
- GPU Throttled: `amdgpu \w{4}:\w{2}:\w{2}.\w: amdgpu: WARN: GPU ...`
- RAS Poison Consumed: `amdgpu[ 0-9a-fA-F:.]+:(?:\s*amdgpu:)?\s+(?:{\d+...`
- RAS Poison created: `amdgpu[ 0-9a-fA-F:.]+:(?:\s*amdgpu:)?\s+(?:{\d+...`
- Bad page threshold exceeded: `(amdgpu: Saved bad pages (\d+) reaches threshol...`
- RAS Hardware Error: `Hardware error from APEI Generic Hardware Error...`
- Error Address: `Error Address.*(?:\s.*)`
- RAS EDR Event: `EDR: EDR event received`
- DPC Event: `DPC: .*`
- LNet: ko2iblnd has no matching interfaces: `(?:\[[^\]]+\]\s*)?LNetError:.*ko2iblnd:\s*No ma...`
- LNet: Error starting up LNI: `(?:\[[^\]]+\]\s*)?LNetError:\s*.*Error\s*-?\d+\...`
- Lustre: network initialisation failed: `LustreError:.*ptlrpc_init_portals\(\).*network ...` | [DmesgData](#DmesgData-Model) | [DmesgCollector](#Collector-Class-DmesgCollector) | [DmesgAnalyzer](#Data-Analyzer-Class-DmesgAnalyzer) | -| FabricsPlugin | ibstat
ibv_devinfo
ls -l /sys/class/infiniband/*/device/net
mst start
mst status -v
ofed_info -s
rdma dev
rdma link | - | [FabricsDataModel](#FabricsDataModel-Model) | [FabricsCollector](#Collector-Class-FabricsCollector) | - | -| JournalPlugin | journalctl --no-pager --system --output=short-iso
journalctl --no-pager --system --output=json | **Analyzer Args:**
- `check_priority`: Optional[int]
- `group`: bool | [JournalData](#JournalData-Model) | [JournalCollector](#Collector-Class-JournalCollector) | [JournalAnalyzer](#Data-Analyzer-Class-JournalAnalyzer) | -| KernelPlugin | sh -c 'uname -a'
sh -c 'cat /proc/sys/kernel/numa_balancing'
wmic os get Version /Value | **Analyzer Args:**
- `exp_kernel`: Union[str, list]
- `exp_numa`: Optional[int]
- `regex_match`: bool | [KernelDataModel](#KernelDataModel-Model) | [KernelCollector](#Collector-Class-KernelCollector) | [KernelAnalyzer](#Data-Analyzer-Class-KernelAnalyzer) | -| KernelModulePlugin | cat /proc/modules
modinfo amdgpu
wmic os get Version /Value | **Analyzer Args:**
- `kernel_modules`: dict[str, dict]
- `regex_filter`: list[str] | [KernelModuleDataModel](#KernelModuleDataModel-Model) | [KernelModuleCollector](#Collector-Class-KernelModuleCollector) | [KernelModuleAnalyzer](#Data-Analyzer-Class-KernelModuleAnalyzer) | -| MemoryPlugin | free -b
lsmem
numactl -H
wmic OS get FreePhysicalMemory /Value; wmic ComputerSystem get TotalPhysicalMemory /Value | **Analyzer Args:**
- `ratio`: float
- `memory_threshold`: str | [MemoryDataModel](#MemoryDataModel-Model) | [MemoryCollector](#Collector-Class-MemoryCollector) | [MemoryAnalyzer](#Data-Analyzer-Class-MemoryAnalyzer) | -| NetworkPlugin | ip addr show
curl
ethtool {interface}
lldpcli show neighbor
lldpctl
ip neighbor show
niccli --dev {device_num} qos --ets --show
niccli --list_devices
nicctl show card
nicctl show dcqcn
nicctl show environment
nicctl show pcie ats
nicctl show port
nicctl show qos
nicctl show rdma statistics
nicctl show version firmware
nicctl show version host-software
ping
ip route show
ip rule show
wget | - | [NetworkDataModel](#NetworkDataModel-Model) | [NetworkCollector](#Collector-Class-NetworkCollector) | - | -| NvmePlugin | nvme smart-log {dev}
nvme error-log {dev} --log-entries=256
nvme id-ctrl {dev}
nvme id-ns {dev}{ns}
nvme fw-log {dev}
nvme self-test-log {dev}
nvme get-log {dev} --log-id=6 --log-len=512
nvme telemetry-log {dev} --output-file={dev}_{f_name}
nvme list -o json | - | [NvmeDataModel](#NvmeDataModel-Model) | [NvmeCollector](#Collector-Class-NvmeCollector) | - | -| OsPlugin | sh -c '( lsb_release -ds \|\| (cat /etc/*release \| grep PRETTY_NAME) \|\| uname -om ) 2>/dev/null \| head -n1'
cat /etc/*release \| grep VERSION_ID
wmic os get Version /value
wmic os get Caption /Value | **Analyzer Args:**
- `exp_os`: Union[str, list]
- `exact_match`: bool | [OsDataModel](#OsDataModel-Model) | [OsCollector](#Collector-Class-OsCollector) | [OsAnalyzer](#Data-Analyzer-Class-OsAnalyzer) | -| PackagePlugin | dnf list --installed
dpkg-query -W
pacman -Q
cat /etc/*release
wmic product get name,version | **Analyzer Args:**
- `exp_package_ver`: Dict[str, Optional[str]]
- `regex_match`: bool
- `rocm_regex`: Optional[str]
- `enable_rocm_regex`: bool | [PackageDataModel](#PackageDataModel-Model) | [PackageCollector](#Collector-Class-PackageCollector) | [PackageAnalyzer](#Data-Analyzer-Class-PackageAnalyzer) | -| PciePlugin | lspci -d {vendor_id}: -nn
lspci -x
lspci -xxxx
lspci -PP
lspci -PP -d {vendor_id}:{dev_id}
lspci -vvv
lspci -vvvt | **Analyzer Args:**
- `exp_speed`: int
- `exp_width`: int
- `exp_sriov_count`: int
- `exp_gpu_count_override`: Optional[int]
- `exp_max_payload_size`: Union[Dict[int, int], int, NoneType]
- `exp_max_rd_req_size`: Union[Dict[int, int], int, NoneType]
- `exp_ten_bit_tag_req_en`: Union[Dict[int, int], int, NoneType] | [PcieDataModel](#PcieDataModel-Model) | [PcieCollector](#Collector-Class-PcieCollector) | [PcieAnalyzer](#Data-Analyzer-Class-PcieAnalyzer) | -| ProcessPlugin | top -b -n 1
rocm-smi --showpids
top -b -n 1 -o %CPU | **Analyzer Args:**
- `max_kfd_processes`: int
- `max_cpu_usage`: float | [ProcessDataModel](#ProcessDataModel-Model) | [ProcessCollector](#Collector-Class-ProcessCollector) | [ProcessAnalyzer](#Data-Analyzer-Class-ProcessAnalyzer) | -| RocmPlugin | {rocm_path}/opencl/bin/*/clinfo
env \| grep -Ei 'rocm\|hsa\|hip\|mpi\|openmp\|ucx\|miopen'
ls /sys/class/kfd/kfd/proc/
grep -i -E 'rocm' /etc/ld.so.conf.d/*
{rocm_path}/bin/rocminfo
ls -v -d /opt/rocm*
ls -v -d /opt/rocm-[3-7]* \| tail -1
ldconfig -p \| grep -i -E 'rocm'
grep . -r /opt/rocm/.info/*
/opt/rocm/.info/version-rocm
/opt/rocm/.info/version | **Analyzer Args:**
- `exp_rocm`: Union[str, list]
- `exp_rocm_latest`: str
- `exp_rocm_sub_versions`: dict[str, Union[str, list]] | [RocmDataModel](#RocmDataModel-Model) | [RocmCollector](#Collector-Class-RocmCollector) | [RocmAnalyzer](#Data-Analyzer-Class-RocmAnalyzer) | -| StoragePlugin | sh -c 'df -lH -B1 \| grep -v 'boot''
wmic LogicalDisk Where DriveType="3" Get DeviceId,Size,FreeSpace | - | [StorageDataModel](#StorageDataModel-Model) | [StorageCollector](#Collector-Class-StorageCollector) | [StorageAnalyzer](#Data-Analyzer-Class-StorageAnalyzer) | -| SysSettingsPlugin | cat /sys/{}
ls -1 /sys/{} | **Analyzer Args:**
- `checks`: Optional[list[nodescraper.plugins.inband.sys_settings.analyzer_args.SysfsCheck]] | [SysSettingsDataModel](#SysSettingsDataModel-Model) | [SysSettingsCollector](#Collector-Class-SysSettingsCollector) | [SysSettingsAnalyzer](#Data-Analyzer-Class-SysSettingsAnalyzer) | -| SysctlPlugin | sysctl -n | **Analyzer Args:**
- `exp_vm_swappiness`: Optional[int]
- `exp_vm_numa_balancing`: Optional[int]
- `exp_vm_oom_kill_allocating_task`: Optional[int]
- `exp_vm_compaction_proactiveness`: Optional[int]
- `exp_vm_compact_unevictable_allowed`: Optional[int]
- `exp_vm_extfrag_threshold`: Optional[int]
- `exp_vm_zone_reclaim_mode`: Optional[int]
- `exp_vm_dirty_background_ratio`: Optional[int]
- `exp_vm_dirty_ratio`: Optional[int]
- `exp_vm_dirty_writeback_centisecs`: Optional[int]
- `exp_kernel_numa_balancing`: Optional[int] | [SysctlDataModel](#SysctlDataModel-Model) | [SysctlCollector](#Collector-Class-SysctlCollector) | [SysctlAnalyzer](#Data-Analyzer-Class-SysctlAnalyzer) | -| SyslogPlugin | ls -1 /var/log/syslog* 2>/dev/null \| grep -E '^/var/log/syslog(\.[0-9]+(\.gz)?)?$' \|\| true | - | [SyslogData](#SyslogData-Model) | [SyslogCollector](#Collector-Class-SyslogCollector) | - | -| UptimePlugin | uptime | - | [UptimeDataModel](#UptimeDataModel-Model) | [UptimeCollector](#Collector-Class-UptimeCollector) | - | +| Plugin | Collection | Analyzer Args | Collection Args | DataModel | Collector | Analyzer | +| --- | --- | --- | --- | --- | --- | --- | +| AmdSmiPlugin | bad-pages
firmware --json
list --json
metric -g all
partition --json
process --json
ras --cper --folder={folder}
ras --afid --cper-file {cper_file}
static -g all --json
static -g {gpu_id} --json
topology
version --json
xgmi -l
xgmi -m | **Analyzer Args:**
- `check_static_data`: bool
- `expected_gpu_processes`: Optional[int]
- `expected_max_power`: Optional[int]
- `expected_driver_version`: Optional[str]
- `expected_memory_partition_mode`: Optional[str]
- `expected_compute_partition_mode`: Optional[str]
- `expected_pldm_version`: Optional[str]
- `l0_to_recovery_count_error_threshold`: Optional[int]
- `l0_to_recovery_count_warning_threshold`: Optional[int]
- `vendorid_ep`: Optional[str]
- `vendorid_ep_vf`: Optional[str]
- `devid_ep`: Optional[str]
- `devid_ep_vf`: Optional[str]
- `sku_name`: Optional[str]
- `expected_xgmi_speed`: Optional[list[float]]
- `analysis_range_start`: Optional[datetime.datetime]
- `analysis_range_end`: Optional[datetime.datetime] | **Collection Args:**
- `cper_file_path`: Optional[str] | [AmdSmiDataModel](#AmdSmiDataModel-Model) | [AmdSmiCollector](#Collector-Class-AmdSmiCollector) | [AmdSmiAnalyzer](#Data-Analyzer-Class-AmdSmiAnalyzer) | +| BiosPlugin | sh -c 'cat /sys/devices/virtual/dmi/id/bios_version'
wmic bios get SMBIOSBIOSVersion /Value | **Analyzer Args:**
- `exp_bios_version`: list[str]
- `regex_match`: bool | - | [BiosDataModel](#BiosDataModel-Model) | [BiosCollector](#Collector-Class-BiosCollector) | [BiosAnalyzer](#Data-Analyzer-Class-BiosAnalyzer) | +| CmdlinePlugin | cat /proc/cmdline | **Analyzer Args:**
- `required_cmdline`: Union[str, List]
- `banned_cmdline`: Union[str, List]
- `os_overrides`: Dict[str, nodescraper.plugins.inband.cmdline.cmdlineconfig.OverrideConfig]
- `platform_overrides`: Dict[str, nodescraper.plugins.inband.cmdline.cmdlineconfig.OverrideConfig] | - | [CmdlineDataModel](#CmdlineDataModel-Model) | [CmdlineCollector](#Collector-Class-CmdlineCollector) | [CmdlineAnalyzer](#Data-Analyzer-Class-CmdlineAnalyzer) | +| DeviceEnumerationPlugin | powershell -Command "(Get-WmiObject -Class Win32_Processor \| Measure-Object).Count"
lspci -d {vendorid_ep}: \| grep -i 'VGA\\|Display\\|3D' \| wc -l
powershell -Command "(wmic path win32_VideoController get name \| findstr AMD \| Measure-Object).Count"
lscpu
lshw
lspci -d {vendorid_ep}: \| grep -i 'Virtual Function' \| wc -l
powershell -Command "(Get-VMHostPartitionableGpu \| Measure-Object).Count" | **Analyzer Args:**
- `cpu_count`: Optional[list[int]]
- `gpu_count`: Optional[list[int]]
- `vf_count`: Optional[list[int]] | - | [DeviceEnumerationDataModel](#DeviceEnumerationDataModel-Model) | [DeviceEnumerationCollector](#Collector-Class-DeviceEnumerationCollector) | [DeviceEnumerationAnalyzer](#Data-Analyzer-Class-DeviceEnumerationAnalyzer) | +| DimmPlugin | sh -c 'dmidecode -t 17 \| tr -s " " \| grep -v "Volatile\\|None\\|Module" \| grep Size' 2>/dev/null
dmidecode
wmic memorychip get Capacity | - | **Collection Args:**
- `skip_sudo`: bool | [DimmDataModel](#DimmDataModel-Model) | [DimmCollector](#Collector-Class-DimmCollector) | - | +| DkmsPlugin | dkms status
dkms --version | **Analyzer Args:**
- `dkms_status`: Union[str, list]
- `dkms_version`: Union[str, list]
- `regex_match`: bool | - | [DkmsDataModel](#DkmsDataModel-Model) | [DkmsCollector](#Collector-Class-DkmsCollector) | [DkmsAnalyzer](#Data-Analyzer-Class-DkmsAnalyzer) | +| DmesgPlugin | dmesg --time-format iso -x
ls -1 /var/log/dmesg* 2>/dev/null \| grep -E '^/var/log/dmesg(\.[0-9]+(\.gz)?)?$' \|\| true | **Built-in Regexes:**
- Out of memory error: `(?:oom_kill_process.*)\|(?:Out of memory.*)`
- I/O Page Fault: `IO_PAGE_FAULT`
- Kernel Panic: `\bkernel panic\b.*`
- SQ Interrupt: `sq_intr`
- SRAM ECC: `sram_ecc.*`
- Failed to load driver. IP hardware init error.: `\[amdgpu\]\] \*ERROR\* hw_init of IP block.*`
- Failed to load driver. IP software init error.: `\[amdgpu\]\] \*ERROR\* sw_init of IP block.*`
- Real Time throttling activated: `sched: RT throttling activated.*`
- RCU preempt detected stalls: `rcu_preempt detected stalls.*`
- RCU preempt self-detected stall: `rcu_preempt self-detected stall.*`
- QCM fence timeout: `qcm fence wait loop timeout.*`
- General protection fault: `(?:[\w-]+(?:\[[0-9.]+\])?\s+)?general protectio...`
- Segmentation fault: `(?:segfault.*in .*\[)\|(?:[Ss]egmentation [Ff]au...`
- Failed to disallow cf state: `amdgpu: Failed to disallow cf state.*`
- Failed to terminate tmr: `\*ERROR\* Failed to terminate tmr.*`
- Suspend of IP block failed: `\*ERROR\* suspend of IP block <\w+> failed.*`
- amdgpu Page Fault: `(amdgpu \w{4}:\w{2}:\w{2}\.\w:\s+amdgpu:\s+\[\S...`
- Page Fault: `page fault for address.*`
- Fatal error during GPU init: `(?:amdgpu)(.*Fatal error during GPU init)\|(Fata...`
- PCIe AER Error Status: `(pcieport [\w:.]+: AER: aer_status:[^\n]*(?:\n[...`
- PCIe AER Correctable Error Status: `(.*aer_cor_status: 0x[0-9a-fA-F]+, aer_cor_mask...`
- PCIe AER Uncorrectable Error Status: `(.*aer_uncor_status: 0x[0-9a-fA-F]+, aer_uncor_...`
- PCIe AER Uncorrectable Error Severity with TLP Header: `(.*aer_uncor_severity: 0x[0-9a-fA-F]+.*)(\n.*TL...`
- Failed to read journal file: `Failed to read journal file.*`
- Journal file corrupted or uncleanly shut down: `journal corrupted or uncleanly shut down.*`
- ACPI BIOS Error: `ACPI BIOS Error`
- ACPI Error: `ACPI Error`
- Filesystem corrupted!: `EXT4-fs error \(device .*\):`
- Error in buffered IO, check filesystem integrity: `(Buffer I\/O error on dev)(?:ice)? (\w+)`
- PCIe card no longer present: `pcieport (\w+:\w+:\w+\.\w+):\s+(\w+):\s+(Slot\(...`
- PCIe Link Down: `pcieport (\w+:\w+:\w+\.\w+):\s+(\w+):\s+(Slot\(...`
- Mismatched clock configuration between PCIe device and host: `pcieport (\w+:\w+:\w+\.\w+):\s+(\w+):\s+(curren...`
- RAS Correctable Error: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....`
- RAS Uncorrectable Error: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....`
- RAS Deferred Error: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....`
- RAS Corrected PCIe Error: `((?:\[Hardware Error\]:\s+)?event severity: cor...`
- GPU Reset: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....`
- GPU reset failed: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....`
- ACA Error: `(Accelerator Check Architecture[^\n]*)(?:\n[^\n...`
- ACA Error: `(Accelerator Check Architecture[^\n]*)(?:\n[^\n...`
- MCE Error: `\[Hardware Error\]:.+MC\d+_STATUS.*(?:\n.*){0,5}`
- Mode 2 Reset Failed: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)? (...`
- RAS Corrected Error: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....`
- SGX Error: `x86/cpu: SGX disabled by BIOS`
- MMP Error: `Failed to load MMP firmware qat_4xxx_mmp.bin`
- GPU Throttled: `amdgpu \w{4}:\w{2}:\w{2}.\w: amdgpu: WARN: GPU ...`
- RAS Poison Consumed: `amdgpu[ 0-9a-fA-F:.]+:(?:\s*amdgpu:)?\s+(?:{\d+...`
- RAS Poison created: `amdgpu[ 0-9a-fA-F:.]+:(?:\s*amdgpu:)?\s+(?:{\d+...`
- Bad page threshold exceeded: `(amdgpu: Saved bad pages (\d+) reaches threshol...`
- RAS Hardware Error: `Hardware error from APEI Generic Hardware Error...`
- Error Address: `Error Address.*(?:\s.*)`
- RAS EDR Event: `EDR: EDR event received`
- DPC Event: `DPC: .*`
- LNet: ko2iblnd has no matching interfaces: `(?:\[[^\]]+\]\s*)?LNetError:.*ko2iblnd:\s*No ma...`
- LNet: Error starting up LNI: `(?:\[[^\]]+\]\s*)?LNetError:\s*.*Error\s*-?\d+\...`
- Lustre: network initialisation failed: `LustreError:.*ptlrpc_init_portals\(\).*network ...` | **Collection Args:**
- `collect_rotated_logs`: bool
- `skip_sudo`: bool
- `log_dmesg_data`: bool | [DmesgData](#DmesgData-Model) | [DmesgCollector](#Collector-Class-DmesgCollector) | [DmesgAnalyzer](#Data-Analyzer-Class-DmesgAnalyzer) | +| FabricsPlugin | lspci \| grep -i cassini
lsmod \| grep cxi
cxi_stat
ibstat
ibv_devinfo
ls -l /sys/class/infiniband/*/device/net
fi_info -p cxi
mst start
mst status -v
ip link show
ofed_info -s | - | - | [FabricsDataModel](#FabricsDataModel-Model) | [FabricsCollector](#Collector-Class-FabricsCollector) | - | +| JournalPlugin | journalctl --no-pager --system --output=short-iso
journalctl --no-pager --system --output=json | **Analyzer Args:**
- `check_priority`: Optional[int]
- `group`: bool | **Collection Args:**
- `boot`: Optional[int] | [JournalData](#JournalData-Model) | [JournalCollector](#Collector-Class-JournalCollector) | [JournalAnalyzer](#Data-Analyzer-Class-JournalAnalyzer) | +| KernelPlugin | sh -c 'uname -a'
sh -c 'cat /proc/sys/kernel/numa_balancing'
wmic os get Version /Value | **Analyzer Args:**
- `exp_kernel`: Union[str, list]
- `exp_numa`: Optional[int]
- `regex_match`: bool | - | [KernelDataModel](#KernelDataModel-Model) | [KernelCollector](#Collector-Class-KernelCollector) | [KernelAnalyzer](#Data-Analyzer-Class-KernelAnalyzer) | +| KernelModulePlugin | cat /proc/modules
modinfo amdgpu
wmic os get Version /Value | **Analyzer Args:**
- `kernel_modules`: dict[str, dict]
- `regex_filter`: list[str] | - | [KernelModuleDataModel](#KernelModuleDataModel-Model) | [KernelModuleCollector](#Collector-Class-KernelModuleCollector) | [KernelModuleAnalyzer](#Data-Analyzer-Class-KernelModuleAnalyzer) | +| MemoryPlugin | free -b
lsmem
numactl -H
wmic OS get FreePhysicalMemory /Value; wmic ComputerSystem get TotalPhysicalMemory /Value | **Analyzer Args:**
- `ratio`: float
- `memory_threshold`: str | - | [MemoryDataModel](#MemoryDataModel-Model) | [MemoryCollector](#Collector-Class-MemoryCollector) | [MemoryAnalyzer](#Data-Analyzer-Class-MemoryAnalyzer) | +| NetworkPlugin | ip addr show
curl
ethtool -S {interface}
ethtool {interface}
lldpcli show neighbor
lldpctl
ip neighbor show
ping
ip route show
ip rule show
wget | - | **Collection Args:**
- `url`: Optional[str]
- `netprobe`: Optional[Literal['ping', 'wget', 'curl']] | [NetworkDataModel](#NetworkDataModel-Model) | [NetworkCollector](#Collector-Class-NetworkCollector) | - | +| NicPlugin | - | **Analyzer Args:**
- `expected_values`: Optional[Dict[str, Dict[str, Any]]]
- `performance_profile_expected`: str
- `support_rdma_disabled_values`: List[str]
- `pcie_relaxed_ordering_expected`: str
- `expected_qos_prio_map`: Optional[Dict[Any, Any]]
- `expected_qos_pfc_enabled`: Optional[int]
- `expected_qos_tsa_map`: Optional[Dict[Any, Any]]
- `expected_qos_tc_bandwidth`: Optional[List[int]]
- `require_qos_consistent_across_adapters`: bool
- `nicctl_log_error_regex`: Optional[List[Dict[str, Any]]] | **Collection Args:**
- `commands`: Optional[List[str]]
- `use_sudo_niccli`: bool
- `use_sudo_nicctl`: bool | [NicDataModel](#NicDataModel-Model) | [NicCollector](#Collector-Class-NicCollector) | [NicAnalyzer](#Data-Analyzer-Class-NicAnalyzer) | +| NvmePlugin | nvme smart-log {dev}
nvme error-log {dev} --log-entries=256
nvme id-ctrl {dev}
nvme id-ns {dev}{ns}
nvme fw-log {dev}
nvme self-test-log {dev}
nvme get-log {dev} --log-id=6 --log-len=512
nvme telemetry-log {dev} --output-file={dev}_{f_name}
nvme list -o json | - | - | [NvmeDataModel](#NvmeDataModel-Model) | [NvmeCollector](#Collector-Class-NvmeCollector) | - | +| OsPlugin | sh -c '( lsb_release -ds \|\| (cat /etc/*release \| grep PRETTY_NAME) \|\| uname -om ) 2>/dev/null \| head -n1'
cat /etc/*release \| grep VERSION_ID
wmic os get Version /value
wmic os get Caption /Value | **Analyzer Args:**
- `exp_os`: Union[str, list]
- `exact_match`: bool | - | [OsDataModel](#OsDataModel-Model) | [OsCollector](#Collector-Class-OsCollector) | [OsAnalyzer](#Data-Analyzer-Class-OsAnalyzer) | +| PackagePlugin | dnf list --installed
dpkg-query -W
pacman -Q
cat /etc/*release
wmic product get name,version | **Analyzer Args:**
- `exp_package_ver`: Dict[str, Optional[str]]
- `regex_match`: bool
- `rocm_regex`: Optional[str]
- `enable_rocm_regex`: bool | - | [PackageDataModel](#PackageDataModel-Model) | [PackageCollector](#Collector-Class-PackageCollector) | [PackageAnalyzer](#Data-Analyzer-Class-PackageAnalyzer) | +| PciePlugin | lspci -d {vendor_id}: -nn
lspci -x
lspci -xxxx
lspci -PP
lspci -PP -d {vendor_id}:{dev_id}
lspci -vvv
lspci -vvvt | **Analyzer Args:**
- `exp_speed`: int
- `exp_width`: int
- `exp_sriov_count`: int
- `exp_gpu_count_override`: Optional[int]
- `exp_max_payload_size`: Union[Dict[int, int], int, NoneType]
- `exp_max_rd_req_size`: Union[Dict[int, int], int, NoneType]
- `exp_ten_bit_tag_req_en`: Union[Dict[int, int], int, NoneType] | - | [PcieDataModel](#PcieDataModel-Model) | [PcieCollector](#Collector-Class-PcieCollector) | [PcieAnalyzer](#Data-Analyzer-Class-PcieAnalyzer) | +| ProcessPlugin | top -b -n 1
rocm-smi --showpids
top -b -n 1 -o %CPU | **Analyzer Args:**
- `max_kfd_processes`: int
- `max_cpu_usage`: float | **Collection Args:**
- `top_n_process`: int | [ProcessDataModel](#ProcessDataModel-Model) | [ProcessCollector](#Collector-Class-ProcessCollector) | [ProcessAnalyzer](#Data-Analyzer-Class-ProcessAnalyzer) | +| RdmaPlugin | rdma link -j
rdma dev
rdma link
rdma statistic -j | - | - | [RdmaDataModel](#RdmaDataModel-Model) | [RdmaCollector](#Collector-Class-RdmaCollector) | [RdmaAnalyzer](#Data-Analyzer-Class-RdmaAnalyzer) | +| RocmPlugin | {rocm_path}/opencl/bin/*/clinfo
env \| grep -Ei 'rocm\|hsa\|hip\|mpi\|openmp\|ucx\|miopen'
ls /sys/class/kfd/kfd/proc/
grep -i -E 'rocm' /etc/ld.so.conf.d/*
{rocm_path}/bin/rocminfo
ls -v -d {rocm_path}*
ls -v -d {rocm_path}-[3-7]* \| tail -1
ldconfig -p \| grep -i -E 'rocm'
grep . -r {rocm_path}/.info/* | **Analyzer Args:**
- `exp_rocm`: Union[str, list]
- `exp_rocm_latest`: str
- `exp_rocm_sub_versions`: dict[str, Union[str, list]] | **Collection Args:**
- `rocm_path`: str | [RocmDataModel](#RocmDataModel-Model) | [RocmCollector](#Collector-Class-RocmCollector) | [RocmAnalyzer](#Data-Analyzer-Class-RocmAnalyzer) | +| StoragePlugin | sh -c 'df -lH -B1 \| grep -v 'boot''
wmic LogicalDisk Where DriveType="3" Get DeviceId,Size,FreeSpace | - | **Collection Args:**
- `skip_sudo`: bool | [StorageDataModel](#StorageDataModel-Model) | [StorageCollector](#Collector-Class-StorageCollector) | [StorageAnalyzer](#Data-Analyzer-Class-StorageAnalyzer) | +| SysSettingsPlugin | cat /sys/{}
ls -1 /sys/{}
ls -l /sys/{} | **Analyzer Args:**
- `checks`: Optional[list[nodescraper.plugins.inband.sys_settings.analyzer_args.SysfsCheck]] | **Collection Args:**
- `paths`: list[str]
- `directory_paths`: list[str] | [SysSettingsDataModel](#SysSettingsDataModel-Model) | [SysSettingsCollector](#Collector-Class-SysSettingsCollector) | [SysSettingsAnalyzer](#Data-Analyzer-Class-SysSettingsAnalyzer) | +| SysctlPlugin | sysctl -n | **Analyzer Args:**
- `exp_vm_swappiness`: Optional[int]
- `exp_vm_numa_balancing`: Optional[int]
- `exp_vm_oom_kill_allocating_task`: Optional[int]
- `exp_vm_compaction_proactiveness`: Optional[int]
- `exp_vm_compact_unevictable_allowed`: Optional[int]
- `exp_vm_extfrag_threshold`: Optional[int]
- `exp_vm_zone_reclaim_mode`: Optional[int]
- `exp_vm_dirty_background_ratio`: Optional[int]
- `exp_vm_dirty_ratio`: Optional[int]
- `exp_vm_dirty_writeback_centisecs`: Optional[int]
- `exp_kernel_numa_balancing`: Optional[int] | - | [SysctlDataModel](#SysctlDataModel-Model) | [SysctlCollector](#Collector-Class-SysctlCollector) | [SysctlAnalyzer](#Data-Analyzer-Class-SysctlAnalyzer) | +| SyslogPlugin | ls -1 /var/log/syslog* 2>/dev/null \| grep -E '^/var/log/syslog(\.[0-9]+(\.gz)?)?$' \|\| true | - | - | [SyslogData](#SyslogData-Model) | [SyslogCollector](#Collector-Class-SyslogCollector) | - | +| UptimePlugin | uptime | - | - | [UptimeDataModel](#UptimeDataModel-Model) | [UptimeCollector](#Collector-Class-UptimeCollector) | - | # Collectors @@ -52,6 +54,11 @@ Class for collection of inband tool amd-smi data. - **CMD_FIRMWARE**: `firmware --json` - **CMD_STATIC**: `static -g all --json` - **CMD_STATIC_GPU**: `static -g {gpu_id} --json` +- **CMD_TOPOLOGY**: `topology` +- **CMD_METRIC**: `metric -g all` +- **CMD_BAD_PAGES**: `bad-pages` +- **CMD_XGMI_METRIC**: `xgmi -m` +- **CMD_XGMI_LINK**: `xgmi -l` - **CMD_RAS**: `ras --cper --folder={folder}` - **CMD_RAS_AFID**: `ras --afid --cper-file {cper_file}` @@ -61,15 +68,20 @@ AmdSmiDataModel ### Commands +- bad-pages - firmware --json - list --json +- metric -g all - partition --json - process --json - ras --cper --folder={folder} - ras --afid --cper-file {cper_file} - static -g all --json - static -g {gpu_id} --json +- topology - version --json +- xgmi -l +- xgmi -m ## Collector Class BiosCollector @@ -246,8 +258,11 @@ Collect InfiniBand/RDMA fabrics configuration details - **CMD_OFED_INFO**: `ofed_info -s` - **CMD_MST_START**: `mst start` - **CMD_MST_STATUS**: `mst status -v` -- **CMD_RDMA_DEV**: `rdma dev` -- **CMD_RDMA_LINK**: `rdma link` +- **CMD_CASSINI_PCI**: `lspci | grep -i cassini` +- **CMD_NET_LINK**: `ip link show` +- **CMD_LIBFABRIC_INFO**: `fi_info -p cxi` +- **CMD_CXI_STAT**: `cxi_stat` +- **CMD_CXI_MODULES**: `lsmod | grep cxi` ### Provides Data @@ -255,14 +270,17 @@ FabricsDataModel ### Commands +- lspci | grep -i cassini +- lsmod | grep cxi +- cxi_stat - ibstat - ibv_devinfo - ls -l /sys/class/infiniband/*/device/net +- fi_info -p cxi - mst start - mst status -v +- ip link show - ofed_info -s -- rdma dev -- rdma link ## Collector Class JournalCollector @@ -386,22 +404,12 @@ Collect network configuration details using ip command - **CMD_RULE**: `ip rule show` - **CMD_NEIGHBOR**: `ip neighbor show` - **CMD_ETHTOOL_TEMPLATE**: `ethtool {interface}` +- **CMD_ETHTOOL_S_TEMPLATE**: `ethtool -S {interface}` - **CMD_PING**: `ping` - **CMD_WGET**: `wget` - **CMD_CURL**: `curl` - **CMD_LLDPCLI_NEIGHBOR**: `lldpcli show neighbor` - **CMD_LLDPCTL**: `lldpctl` -- **CMD_NICCLI_LISTDEV**: `niccli --list_devices` -- **CMD_NICCLI_GETQOS_TEMPLATE**: `niccli --dev {device_num} qos --ets --show` -- **CMD_NICCTL_CARD**: `nicctl show card` -- **CMD_NICCTL_DCQCN**: `nicctl show dcqcn` -- **CMD_NICCTL_ENVIRONMENT**: `nicctl show environment` -- **CMD_NICCTL_PCIE_ATS**: `nicctl show pcie ats` -- **CMD_NICCTL_PORT**: `nicctl show port` -- **CMD_NICCTL_QOS**: `nicctl show qos` -- **CMD_NICCTL_RDMA_STATISTICS**: `nicctl show rdma statistics` -- **CMD_NICCTL_VERSION_HOST_SOFTWARE**: `nicctl show version host-software` -- **CMD_NICCTL_VERSION_FIRMWARE**: `nicctl show version firmware` ### Provides Data @@ -411,26 +419,30 @@ NetworkDataModel - ip addr show - curl +- ethtool -S {interface} - ethtool {interface} - lldpcli show neighbor - lldpctl - ip neighbor show -- niccli --dev {device_num} qos --ets --show -- niccli --list_devices -- nicctl show card -- nicctl show dcqcn -- nicctl show environment -- nicctl show pcie ats -- nicctl show port -- nicctl show qos -- nicctl show rdma statistics -- nicctl show version firmware -- nicctl show version host-software - ping - ip route show - ip rule show - wget +## Collector Class NicCollector + +### Description + +Collect raw output from niccli (Broadcom) and nicctl (Pensando) commands. + +**Bases**: ['InBandDataCollector'] + +**Link to code**: [nic_collector.py](https://github.com/amd/node-scraper/blob/HEAD/nodescraper/plugins/inband/nic/nic_collector.py) + +### Provides Data + +NicDataModel + ## Collector Class NvmeCollector ### Description @@ -613,6 +625,35 @@ ProcessDataModel - rocm-smi --showpids - top -b -n 1 -o %CPU +## Collector Class RdmaCollector + +### Description + +Collect RDMA status and statistics via rdma link and rdma statistic commands. + +**Bases**: ['InBandDataCollector'] + +**Link to code**: [rdma_collector.py](https://github.com/amd/node-scraper/blob/HEAD/nodescraper/plugins/inband/rdma/rdma_collector.py) + +### Class Variables + +- **SUPPORTED_OS_FAMILY**: `{}` +- **CMD_LINK**: `rdma link -j` +- **CMD_STATISTIC**: `rdma statistic -j` +- **CMD_RDMA_DEV**: `rdma dev` +- **CMD_RDMA_LINK**: `rdma link` + +### Provides Data + +RdmaDataModel + +### Commands + +- rdma link -j +- rdma dev +- rdma link +- rdma statistic -j + ## Collector Class RocmCollector ### Description @@ -626,15 +667,14 @@ Collect ROCm version data ### Class Variables - **SUPPORTED_OS_FAMILY**: `{}` -- **CMD_VERSION_PATHS**: `['/opt/rocm/.info/version-rocm', '/opt/rocm/.info/version']` -- **CMD_ROCM_SUB_VERSIONS**: `grep . -r /opt/rocm/.info/*` -- **CMD_ROCMINFO**: `{rocm_path}/bin/rocminfo` -- **CMD_ROCM_LATEST**: `ls -v -d /opt/rocm-[3-7]* | tail -1` -- **CMD_ROCM_DIRS**: `ls -v -d /opt/rocm*` +- **CMD_ROCM_SUB_VERSIONS_TMPL**: `grep . -r {rocm_path}/.info/*` +- **CMD_ROCMINFO_TMPL**: `{rocm_path}/bin/rocminfo` +- **CMD_ROCM_LATEST_TMPL**: `ls -v -d {rocm_path}-[3-7]* | tail -1` +- **CMD_ROCM_DIRS_TMPL**: `ls -v -d {rocm_path}*` - **CMD_LD_CONF**: `grep -i -E 'rocm' /etc/ld.so.conf.d/*` - **CMD_ROCM_LIBS**: `ldconfig -p | grep -i -E 'rocm'` - **CMD_ENV_VARS**: `env | grep -Ei 'rocm|hsa|hip|mpi|openmp|ucx|miopen'` -- **CMD_CLINFO**: `{rocm_path}/opencl/bin/*/clinfo` +- **CMD_CLINFO_TMPL**: `{rocm_path}/opencl/bin/*/clinfo` - **CMD_KFD_PROC**: `ls /sys/class/kfd/kfd/proc/` ### Provides Data @@ -648,12 +688,10 @@ RocmDataModel - ls /sys/class/kfd/kfd/proc/ - grep -i -E 'rocm' /etc/ld.so.conf.d/* - {rocm_path}/bin/rocminfo -- ls -v -d /opt/rocm* -- ls -v -d /opt/rocm-[3-7]* | tail -1 +- ls -v -d {rocm_path}* +- ls -v -d {rocm_path}-[3-7]* | tail -1 - ldconfig -p | grep -i -E 'rocm' -- grep . -r /opt/rocm/.info/* -- /opt/rocm/.info/version-rocm -- /opt/rocm/.info/version +- grep . -r {rocm_path}/.info/* ## Collector Class StorageCollector @@ -694,6 +732,7 @@ Collect sysfs settings from user-specified paths. - **SUPPORTED_OS_FAMILY**: `{}` - **CMD**: `cat /sys/{}` - **CMD_LS**: `ls -1 /sys/{}` +- **CMD_LS_LONG**: `ls -l /sys/{}` ### Provides Data @@ -703,6 +742,7 @@ SysSettingsDataModel - cat /sys/{} - ls -1 /sys/{} +- ls -l /sys/{} ## Collector Class SysctlCollector @@ -805,7 +845,6 @@ Data model for amd-smi data. - **xgmi_link**: `Optional[list[nodescraper.plugins.inband.amdsmi.amdsmidata.XgmiLinks]]` - **cper_data**: `Optional[list[nodescraper.models.datamodel.FileModel]]` - **cper_afids**: `dict[str, int]` -- **amdsmitst_data**: `nodescraper.plugins.inband.amdsmi.amdsmidata.AmdSmiTstData` ## BiosDataModel Model @@ -894,7 +933,7 @@ Complete InfiniBand/RDMA fabrics configuration data - **ibdev_netdev_mappings**: `List[nodescraper.plugins.inband.fabrics.fabricsdata.IbdevNetdevMapping]` - **ofed_info**: `Optional[nodescraper.plugins.inband.fabrics.fabricsdata.OfedInfo]` - **mst_status**: `Optional[nodescraper.plugins.inband.fabrics.fabricsdata.MstStatus]` -- **rdma_info**: `Optional[nodescraper.plugins.inband.fabrics.fabricsdata.RdmaInfo]` +- **slingshot_data**: `Optional[nodescraper.plugins.inband.fabrics.fabricsdata.SlingshotData]` ## JournalData Model @@ -968,19 +1007,47 @@ Complete network configuration data - **rules**: `List[nodescraper.plugins.inband.network.networkdata.RoutingRule]` - **neighbors**: `List[nodescraper.plugins.inband.network.networkdata.Neighbor]` - **ethtool_info**: `Dict[str, nodescraper.plugins.inband.network.networkdata.EthtoolInfo]` -- **broadcom_nic_devices**: `List[nodescraper.plugins.inband.network.networkdata.BroadcomNicDevice]` -- **broadcom_nic_qos**: `Dict[int, nodescraper.plugins.inband.network.networkdata.BroadcomNicQos]` -- **pensando_nic_cards**: `List[nodescraper.plugins.inband.network.networkdata.PensandoNicCard]` -- **pensando_nic_dcqcn**: `List[nodescraper.plugins.inband.network.networkdata.PensandoNicDcqcn]` -- **pensando_nic_environment**: `List[nodescraper.plugins.inband.network.networkdata.PensandoNicEnvironment]` -- **pensando_nic_pcie_ats**: `List[nodescraper.plugins.inband.network.networkdata.PensandoNicPcieAts]` -- **pensando_nic_ports**: `List[nodescraper.plugins.inband.network.networkdata.PensandoNicPort]` -- **pensando_nic_qos**: `List[nodescraper.plugins.inband.network.networkdata.PensandoNicQos]` -- **pensando_nic_rdma_statistics**: `List[nodescraper.plugins.inband.network.networkdata.PensandoNicRdmaStatistics]` -- **pensando_nic_version_host_software**: `Optional[nodescraper.plugins.inband.network.networkdata.PensandoNicVersionHostSoftware]` -- **pensando_nic_version_firmware**: `List[nodescraper.plugins.inband.network.networkdata.PensandoNicVersionFirmware]` - **accessible**: `Optional[bool]` +## NicDataModel Model + +### Description + +Collected output of niccli (Broadcom) and nicctl (Pensando) commands. + +**Link to code**: [nic_data.py](https://github.com/amd/node-scraper/blob/HEAD/nodescraper/plugins/inband/nic/nic_data.py) + +**Bases**: ['DataModel'] + +### Model annotations and fields + +- **results**: `Dict[str, nodescraper.plugins.inband.nic.nic_data.NicCommandResult]` +- **card_show**: `Optional[nodescraper.plugins.inband.nic.nic_data.NicCtlCardShow]` +- **cards**: `List[nodescraper.plugins.inband.nic.nic_data.NicCtlCard]` +- **port**: `Optional[nodescraper.plugins.inband.nic.nic_data.NicCtlPort]` +- **lif**: `Optional[nodescraper.plugins.inband.nic.nic_data.NicCtlLif]` +- **qos**: `Optional[nodescraper.plugins.inband.nic.nic_data.NicCtlQos]` +- **rdma**: `Optional[nodescraper.plugins.inband.nic.nic_data.NicCtlRdma]` +- **dcqcn**: `Optional[nodescraper.plugins.inband.nic.nic_data.NicCtlDcqcn]` +- **environment**: `Optional[nodescraper.plugins.inband.nic.nic_data.NicCtlEnvironment]` +- **version**: `Optional[nodescraper.plugins.inband.nic.nic_data.NicCtlVersion]` +- **broadcom_nic_devices**: `List[nodescraper.plugins.inband.nic.nic_data.NicCliDevice]` +- **broadcom_nic_qos**: `Dict[int, nodescraper.plugins.inband.nic.nic_data.NicCliQos]` +- **broadcom_nic_support_rdma**: `Dict[int, str]` +- **broadcom_nic_performance_profile**: `Dict[int, str]` +- **broadcom_nic_pcie_relaxed_ordering**: `Dict[int, str]` +- **pensando_nic_cards**: `List[nodescraper.plugins.inband.nic.nic_data.PensandoNicCard]` +- **pensando_nic_dcqcn**: `List[nodescraper.plugins.inband.nic.nic_data.PensandoNicDcqcn]` +- **pensando_nic_environment**: `List[nodescraper.plugins.inband.nic.nic_data.PensandoNicEnvironment]` +- **pensando_nic_lif**: `List[nodescraper.plugins.inband.nic.nic_data.PensandoNicLif]` +- **pensando_nic_pcie_ats**: `List[nodescraper.plugins.inband.nic.nic_data.PensandoNicPcieAts]` +- **pensando_nic_ports**: `List[nodescraper.plugins.inband.nic.nic_data.PensandoNicPort]` +- **pensando_nic_qos**: `List[nodescraper.plugins.inband.nic.nic_data.PensandoNicQos]` +- **pensando_nic_rdma_statistics**: `List[nodescraper.plugins.inband.nic.nic_data.PensandoNicRdmaStatistics]` +- **pensando_nic_version_host_software**: `Optional[nodescraper.plugins.inband.nic.nic_data.PensandoNicVersionHostSoftware]` +- **pensando_nic_version_firmware**: `List[nodescraper.plugins.inband.nic.nic_data.PensandoNicVersionFirmware]` +- **nicctl_card_logs**: `Optional[Dict[str, str]]` + ## NvmeDataModel Model ### Description @@ -1062,6 +1129,23 @@ class for collection of PCIe data. - **cpu_usage**: `Optional[float]` - **processes**: `Optional[list[tuple[str, str]]]` +## RdmaDataModel Model + +### Description + +Data model for RDMA (Remote Direct Memory Access) statistics and link information. + +**Link to code**: [rdmadata.py](https://github.com/amd/node-scraper/blob/HEAD/nodescraper/plugins/inband/rdma/rdmadata.py) + +**Bases**: ['DataModel'] + +### Model annotations and fields + +- **link_list**: `list[nodescraper.plugins.inband.rdma.rdmadata.RdmaLink]` +- **statistic_list**: `list[nodescraper.plugins.inband.rdma.rdmadata.RdmaStatistics]` +- **dev_list**: `list[nodescraper.plugins.inband.rdma.rdmadata.RdmaDevice]` +- **link_list_text**: `list[nodescraper.plugins.inband.rdma.rdmadata.RdmaLinkText]` + ## RocmDataModel Model **Link to code**: [rocmdata.py](https://github.com/amd/node-scraper/blob/HEAD/nodescraper/plugins/inband/rocm/rocmdata.py) @@ -1159,7 +1243,7 @@ Data model for in band syslog logs ### Description -Check AMD SMI Application data for PCIe, ECC errors, CPER data, and analyze amdsmitst metrics +Check AMD SMI Application data for PCIe, ECC errors, and CPER data. **Bases**: ['CperAnalysisTaskMixin', 'DataAnalyzer'] @@ -1379,6 +1463,16 @@ Check memory usage is within the maximum allowed used memory **Link to code**: [memory_analyzer.py](https://github.com/amd/node-scraper/blob/HEAD/nodescraper/plugins/inband/memory/memory_analyzer.py) +## Data Analyzer Class NicAnalyzer + +### Description + +Analyze niccli/nicctl data; checks Broadcom support_rdma, performance_profile (RoCE), pcie_relaxed_ordering (enabled), and getqos (expected QoS across adapters). + +**Bases**: ['DataAnalyzer'] + +**Link to code**: [nic_analyzer.py](https://github.com/amd/node-scraper/blob/HEAD/nodescraper/plugins/inband/nic/nic_analyzer.py) + ## Data Analyzer Class OsAnalyzer ### Description @@ -1437,6 +1531,108 @@ Check cpu and kfd processes are within allowed maximum cpu and gpu usage **Link to code**: [process_analyzer.py](https://github.com/amd/node-scraper/blob/HEAD/nodescraper/plugins/inband/process/process_analyzer.py) +## Data Analyzer Class RdmaAnalyzer + +### Description + +Check RDMA statistics for errors (RoCE and other RDMA error counters). + +**Bases**: ['DataAnalyzer'] + +**Link to code**: [rdma_analyzer.py](https://github.com/amd/node-scraper/blob/HEAD/nodescraper/plugins/inband/rdma/rdma_analyzer.py) + +### Class Variables + +- **ERROR_FIELDS**: `[ + recoverable_errors, + tx_roce_errors, + tx_roce_discards, + rx_roce_errors, + rx_roce_discards, + local_ack_timeout_err, + packet_seq_err, + max_retry_exceeded, + rnr_nak_retry_err, + implied_nak_seq_err, + unrecoverable_err, + bad_resp_err, + local_qp_op_err, + local_protection_err, + mem_mgmt_op_err, + req_remote_invalid_request, + req_remote_access_errors, + remote_op_err, + duplicate_request, + res_exceed_max, + resp_local_length_error, + res_exceeds_wqe, + res_opcode_err, + res_rx_invalid_rkey, + res_rx_domain_err, + res_rx_no_perm, + res_rx_range_err, + res_tx_invalid_rkey, + res_tx_domain_err, + res_tx_no_perm, + res_tx_range_err, + res_irrq_oflow, + res_unsup_opcode, + res_unaligned_atomic, + res_rem_inv_err, + res_mem_err, + res_srq_err, + res_cmp_err, + res_invalid_dup_rkey, + res_wqe_format_err, + res_cq_load_err, + res_srq_load_err, + res_tx_pci_err, + res_rx_pci_err, + out_of_buffer, + out_of_sequence, + req_cqe_error, + req_cqe_flush_error, + resp_cqe_error, + resp_cqe_flush_error, + resp_remote_access_errors, + req_rx_pkt_seq_err, + req_rx_rnr_retry_err, + req_rx_rmt_acc_err, + req_rx_rmt_req_err, + req_rx_oper_err, + req_rx_impl_nak_seq_err, + req_rx_cqe_err, + req_rx_cqe_flush, + req_rx_dup_response, + req_rx_inval_pkts, + req_tx_loc_acc_err, + req_tx_loc_oper_err, + req_tx_mem_mgmt_err, + req_tx_retry_excd_err, + req_tx_loc_sgl_inv_err, + resp_rx_dup_request, + resp_rx_outof_buf, + resp_rx_outouf_seq, + resp_rx_cqe_err, + resp_rx_cqe_flush, + resp_rx_loc_len_err, + resp_rx_inval_request, + resp_rx_loc_oper_err, + resp_rx_outof_atomic, + resp_tx_pkt_seq_err, + resp_tx_rmt_inval_req_err, + resp_tx_rmt_acc_err, + resp_tx_rmt_oper_err, + resp_tx_rnr_retry_err, + resp_tx_loc_sgl_inv_err, + resp_rx_s0_table_err, + resp_rx_ccl_cts_outouf_seq, + tx_rdma_ack_timeout, + tx_rdma_ccl_cts_ack_timeout, + rx_rdma_mtu_discard_pkts +]` +- **CRITICAL_ERROR_FIELDS**: `['unrecoverable_err', 'res_tx_pci_err', 'res_rx_pci_err', 'res_mem_err']` + ## Data Analyzer Class RocmAnalyzer ### Description @@ -1606,6 +1802,29 @@ Arguments for journal analyzer - **ratio**: `float` - **memory_threshold**: `str` +## Analyzer Args Class NicAnalyzerArgs + +### Description + +Analyzer args for niccli/nicctl data, with expected_values keyed by canonical command key. + +**Bases**: ['AnalyzerArgs'] + +**Link to code**: [analyzer_args.py](https://github.com/amd/node-scraper/blob/HEAD/nodescraper/plugins/inband/nic/analyzer_args.py) + +### Annotations / fields + +- **expected_values**: `Optional[Dict[str, Dict[str, Any]]]` +- **performance_profile_expected**: `str` +- **support_rdma_disabled_values**: `List[str]` +- **pcie_relaxed_ordering_expected**: `str` +- **expected_qos_prio_map**: `Optional[Dict[Any, Any]]` +- **expected_qos_pfc_enabled**: `Optional[int]` +- **expected_qos_tsa_map**: `Optional[Dict[Any, Any]]` +- **expected_qos_tc_bandwidth**: `Optional[List[int]]` +- **require_qos_consistent_across_adapters**: `bool` +- **nicctl_log_error_regex**: `Optional[List[Dict[str, Any]]]` + ## Analyzer Args Class OsAnalyzerArgs **Bases**: ['AnalyzerArgs'] diff --git a/docs/generate_plugin_doc_bundle.py b/docs/generate_plugin_doc_bundle.py index 43268e2a..0c2c839b 100644 --- a/docs/generate_plugin_doc_bundle.py +++ b/docs/generate_plugin_doc_bundle.py @@ -275,6 +275,33 @@ def extract_regexes_and_args_from_analyzer( return output +def extract_collection_args_from_collector_args(args_cls: Optional[type]) -> List[str]: + """Extract collector/collection args from collector args class for the plugin table.""" + if not inspect.isclass(args_cls): + return [] + output: List[str] = [] + # Prefer model_fields for Pydantic models (includes inherited); fallback to __annotations__ + fields = get_attr(args_cls, "model_fields", None) + if fields and isinstance(fields, dict): + # Pydantic v2: model_fields is a dict of field names -> FieldInfo + for key in fields: + try: + finfo = fields[key] + ann = getattr(finfo, "annotation", None) + type_str = format_type_annotation(ann) if ann is not None else "Any" + output.append(f"- `{key}`: {type_str}") + except Exception: + pass + if not output: + anns = get_attr(args_cls, "__annotations__", {}) or {} + for key, value in anns.items(): + type_str = format_type_annotation(value) + output.append(f"- `{key}`: {type_str}") + if output: + output.insert(0, "**Collection Args:**") + return output + + def md_header(text: str, level: int = 2) -> str: return f"{'#' * level} {text}\n\n" @@ -346,6 +373,7 @@ def generate_plugin_table_rows(plugins: List[type]) -> List[List[str]]: col = get_attr(p, "COLLECTOR", None) an = get_attr(p, "ANALYZER", None) args = get_attr(p, "ANALYZER_ARGS", None) + collector_args_cls = get_attr(p, "COLLECTOR_ARGS", None) cmds = [] if inspect.isclass(col): cmds += extract_cmds_from_classvars(col) @@ -363,11 +391,19 @@ def generate_plugin_table_rows(plugins: List[type]) -> List[List[str]]: if inspect.isclass(an): regex_and_args = extract_regexes_and_args_from_analyzer(an, args) + # Extract collection args from collector args class + collection_args_lines = extract_collection_args_from_collector_args(collector_args_cls) + rows.append( [ p.__name__, "
".join(cmds).replace("|", "\\|") if cmds else "-", "
".join(regex_and_args).replace("|", "\\|") if regex_and_args else "-", + ( + "
".join(collection_args_lines).replace("|", "\\|") + if collection_args_lines + else "-" + ), link_anchor(dm, "model") if inspect.isclass(dm) else "-", link_anchor(col, "collector") if inspect.isclass(col) else "-", link_anchor(an, "analyzer") if inspect.isclass(an) else "-", @@ -513,7 +549,8 @@ def all_subclasses(cls: Type) -> set[type]: headers = [ "Plugin", "Collection", - "Analysis", + "Analyzer Args", + "Collection Args", "DataModel", "Collector", "Analyzer", diff --git a/nodescraper/base/__init__.py b/nodescraper/base/__init__.py index c1e8a6bf..8428df4d 100644 --- a/nodescraper/base/__init__.py +++ b/nodescraper/base/__init__.py @@ -25,10 +25,14 @@ ############################################################################### from .inbandcollectortask import InBandDataCollector from .inbanddataplugin import InBandDataPlugin +from .oobanddataplugin import OOBandDataPlugin +from .redfishcollectortask import RedfishDataCollector from .regexanalyzer import RegexAnalyzer __all__ = [ "InBandDataCollector", "InBandDataPlugin", + "OOBandDataPlugin", + "RedfishDataCollector", "RegexAnalyzer", ] diff --git a/nodescraper/base/inbanddataplugin.py b/nodescraper/base/inbanddataplugin.py index 37593a17..13abbea4 100644 --- a/nodescraper/base/inbanddataplugin.py +++ b/nodescraper/base/inbanddataplugin.py @@ -23,16 +23,11 @@ # SOFTWARE. # ############################################################################### -import json -import os -from pathlib import Path -from typing import Any, Generic, Optional +from typing import Generic from nodescraper.connection.inband import InBandConnectionManager, SSHConnectionParams from nodescraper.generictypes import TAnalyzeArg, TCollectArg, TDataModel from nodescraper.interfaces import DataPlugin -from nodescraper.models import DataModel -from nodescraper.utils import pascal_to_snake class InBandDataPlugin( @@ -42,133 +37,3 @@ class InBandDataPlugin( """Base class for in band plugins.""" CONNECTION_TYPE = InBandConnectionManager - - @classmethod - def find_datamodel_path_in_run(cls, run_path: str) -> Optional[str]: - """Find this plugin's collector datamodel file under a scraper run directory. - - Args: - run_path: Path to a scraper log run directory (e.g. scraper_logs_*). - - Returns: - Absolute path to the datamodel file, or None if not found. - """ - run_path = os.path.abspath(run_path) - if not os.path.isdir(run_path): - return None - collector_cls = getattr(cls, "COLLECTOR", None) - data_model_cls = getattr(cls, "DATA_MODEL", None) - if not collector_cls or not data_model_cls: - return None - collector_dir = os.path.join( - run_path, - pascal_to_snake(cls.__name__), - pascal_to_snake(collector_cls.__name__), - ) - if not os.path.isdir(collector_dir): - return None - result_path = os.path.join(collector_dir, "result.json") - if not os.path.isfile(result_path): - return None - try: - res_payload = json.loads(Path(result_path).read_text(encoding="utf-8")) - if res_payload.get("parent") != cls.__name__: - return None - except (json.JSONDecodeError, OSError): - return None - want_json = data_model_cls.__name__.lower() + ".json" - for fname in os.listdir(collector_dir): - low = fname.lower() - if low.endswith("datamodel.json") or low == want_json: - return os.path.join(collector_dir, fname) - if low.endswith(".log"): - return os.path.join(collector_dir, fname) - return None - - @classmethod - def load_datamodel_from_path(cls, dm_path: str) -> Optional[TDataModel]: - """Load this plugin's DATA_MODEL from a file path (JSON or .log). - - Args: - dm_path: Path to datamodel JSON or to a .log file (if DATA_MODEL - implements import_model for that format). - - Returns: - Instance of DATA_MODEL or None if load fails. - """ - dm_path = os.path.abspath(dm_path) - if not os.path.isfile(dm_path): - return None - data_model_cls = getattr(cls, "DATA_MODEL", None) - if not data_model_cls: - return None - try: - if dm_path.lower().endswith(".log"): - import_model = getattr(data_model_cls, "import_model", None) - if not callable(import_model): - return None - base_import = getattr(DataModel.import_model, "__func__", DataModel.import_model) - if getattr(import_model, "__func__", import_model) is base_import: - return None - return import_model(dm_path) - with open(dm_path, encoding="utf-8") as f: - data = json.load(f) - return data_model_cls.model_validate(data) - except (json.JSONDecodeError, OSError, Exception): - return None - - @classmethod - def get_extracted_errors(cls, data_model: DataModel) -> Optional[list[str]]: - """Compute extracted errors from datamodel for compare-runs (in memory only). - - Args: - data_model: Loaded DATA_MODEL instance. - - Returns: - Sorted list of error match strings, or None if not applicable. - """ - get_content = getattr(data_model, "get_compare_content", None) - if not callable(get_content): - return None - try: - content = get_content() - except Exception: - return None - if not isinstance(content, str): - return None - analyzer_cls = getattr(cls, "ANALYZER", None) - if not analyzer_cls: - return None - get_matches = getattr(analyzer_cls, "get_error_matches", None) - if not callable(get_matches): - return None - try: - matches = get_matches(content) - return sorted(matches) if matches is not None else None - except Exception: - return None - - @classmethod - def load_run_data(cls, run_path: str) -> Optional[dict[str, Any]]: - """Load this plugin's run data from a scraper run directory for comparison. - - Args: - run_path: Path to a scraper log run directory or to a datamodel file. - - Returns: - Dict suitable for diffing with another run, or None if not found. - """ - run_path = os.path.abspath(run_path) - if not os.path.exists(run_path): - return None - dm_path = run_path if os.path.isfile(run_path) else cls.find_datamodel_path_in_run(run_path) - if not dm_path: - return None - data_model = cls.load_datamodel_from_path(dm_path) - if data_model is None: - return None - out = data_model.model_dump(mode="json") - extracted = cls.get_extracted_errors(data_model) - if extracted is not None: - out["extracted_errors"] = extracted - return out diff --git a/nodescraper/base/oobanddataplugin.py b/nodescraper/base/oobanddataplugin.py new file mode 100644 index 00000000..c88ffc91 --- /dev/null +++ b/nodescraper/base/oobanddataplugin.py @@ -0,0 +1,48 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from typing import Generic + +from nodescraper.connection.redfish import ( + RedfishConnectionManager, + RedfishConnectionParams, +) +from nodescraper.generictypes import TAnalyzeArg, TCollectArg, TDataModel +from nodescraper.interfaces import DataPlugin + + +class OOBandDataPlugin( + DataPlugin[ + RedfishConnectionManager, + RedfishConnectionParams, + TDataModel, + TCollectArg, + TAnalyzeArg, + ], + Generic[TDataModel, TCollectArg, TAnalyzeArg], +): + """Base class for out-of-band (OOB) plugins that use Redfish connection.""" + + CONNECTION_TYPE = RedfishConnectionManager diff --git a/nodescraper/base/redfishcollectortask.py b/nodescraper/base/redfishcollectortask.py new file mode 100644 index 00000000..b8401213 --- /dev/null +++ b/nodescraper/base/redfishcollectortask.py @@ -0,0 +1,79 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +import logging +from typing import Generic, Optional, Union + +from nodescraper.connection.redfish import RedfishConnection, RedfishGetResult +from nodescraper.enums import EventPriority +from nodescraper.generictypes import TCollectArg, TDataModel +from nodescraper.interfaces import DataCollector, TaskResultHook +from nodescraper.models import SystemInfo + + +class RedfishDataCollector( + DataCollector[RedfishConnection, TDataModel, TCollectArg], + Generic[TDataModel, TCollectArg], +): + """Base class for data collectors that use a Redfish connection.""" + + def __init__( + self, + system_info: SystemInfo, + connection: RedfishConnection, + logger: Optional[logging.Logger] = None, + max_event_priority_level: Union[EventPriority, str] = EventPriority.CRITICAL, + parent: Optional[str] = None, + task_result_hooks: Optional[list[TaskResultHook]] = None, + **kwargs, + ): + super().__init__( + system_info=system_info, + connection=connection, + logger=logger, + max_event_priority_level=max_event_priority_level, + parent=parent, + task_result_hooks=task_result_hooks, + **kwargs, + ) + + def _run_redfish_get( + self, + path: str, + log_artifact: bool = True, + ) -> RedfishGetResult: + """Run a Redfish GET request and return the result. + + Args: + path: Redfish URI path + log_artifact: If True, append the result to self.result.artifacts. + + Returns: + RedfishGetResult: path, success, data (or error), status_code. + """ + res = self.connection.run_get(path) + if log_artifact: + self.result.artifacts.append(res) + return res diff --git a/nodescraper/cli/cli.py b/nodescraper/cli/cli.py index fe41cbab..f4e2fe86 100644 --- a/nodescraper/cli/cli.py +++ b/nodescraper/cli/cli.py @@ -50,9 +50,14 @@ ) from nodescraper.cli.inputargtypes import ModelArgHandler, json_arg, log_path_arg from nodescraper.configregistry import ConfigRegistry +from nodescraper.connection.redfish import ( + RedfishConnection, + get_oem_diagnostic_allowable_values, +) +from nodescraper.connection.redfish.redfish_params import RedfishConnectionParams from nodescraper.constants import DEFAULT_LOGGER from nodescraper.enums import ExecutionStatus, SystemInteractionLevel, SystemLocation -from nodescraper.models import SystemInfo +from nodescraper.models import PluginConfig, SystemInfo from nodescraper.pluginexecutor import PluginExecutor from nodescraper.pluginregistry import PluginRegistry @@ -171,6 +176,7 @@ def build_parser( ) subparsers = parser.add_subparsers(dest="subcmd", help="Subcommands") + subparsers.default = "run-plugins" summary_parser = subparsers.add_parser( "summary", @@ -259,6 +265,17 @@ def build_parser( dest="dont_truncate", help="Do not truncate the Message column; show full error text and all errors (not just first 3)", ) + + show_redfish_allowable_parser = subparsers.add_parser( + "show-redfish-oem-allowable", + help="Fetch OEM diagnostic allowable types from Redfish LogService (for oem_diagnostic_types_allowable)", + ) + show_redfish_allowable_parser.add_argument( + "--log-service-path", + required=True, + help="Redfish path to LogService (e.g. redfish/v1/Systems/UBB/LogServices/DiagLogs)", + ) + config_builder_parser.add_argument( "--plugins", nargs="*", @@ -354,6 +371,14 @@ def main(arg_input: Optional[list[str]] = None): plugin_reg = PluginRegistry() config_reg = ConfigRegistry() + # Add synthetic "AllPlugins" config that includes every registered plugin + config_reg.configs["AllPlugins"] = PluginConfig( + name="AllPlugins", + desc="Run all registered plugins with default arguments", + global_args={}, + plugins={name: {} for name in plugin_reg.plugins}, + result_collators={}, + ) parser, plugin_subparser_map = build_parser(plugin_reg, config_reg) try: @@ -409,6 +434,40 @@ def main(arg_input: Optional[list[str]] = None): ) sys.exit(0) + if parsed_args.subcmd == "show-redfish-oem-allowable": + if not parsed_args.connection_config: + parser.error("show-redfish-oem-allowable requires --connection-config") + raw = parsed_args.connection_config.get("RedfishConnectionManager") + if not raw: + logger.error("Connection config must contain RedfishConnectionManager") + sys.exit(1) + params = RedfishConnectionParams.model_validate(raw) + password = params.password.get_secret_value() if params.password else None + base_url = f"{'https' if params.use_https else 'http'}://{params.host}" + ( + f":{params.port}" if params.port else "" + ) + conn = RedfishConnection( + base_url=base_url, + username=params.username, + password=password, + timeout=params.timeout_seconds, + use_session_auth=params.use_session_auth, + verify_ssl=params.verify_ssl, + api_root=params.api_root, + ) + try: + conn._ensure_session() + allowable = get_oem_diagnostic_allowable_values(conn, parsed_args.log_service_path) + if allowable is None: + logger.warning( + "Could not read OEMDiagnosticDataType@Redfish.AllowableValues from LogService" + ) + sys.exit(1) + print(json.dumps(allowable, indent=2)) # noqa: T201 + finally: + conn.close() + sys.exit(0) + if parsed_args.subcmd == "gen-plugin-config": if parsed_args.reference_config_from_logs: @@ -480,21 +539,27 @@ def main(arg_input: Optional[list[str]] = None): dump_results_to_csv(results, sname, log_path, timestamp, logger) if parsed_args.reference_config: - ref_config = generate_reference_config(results, plugin_reg, logger) - if log_path: - path = os.path.join(log_path, "reference_config.json") + if any(result.status > ExecutionStatus.WARNING for result in results): + logger.warning("Skipping reference config write because one or more plugins failed") else: - path = os.path.join(os.getcwd(), "reference_config.json") - try: - with open(path, "w") as f: - json.dump( - ref_config.model_dump(mode="json", exclude_none=True), - f, - indent=2, - ) - logger.info("Reference config written to: %s", path) - except Exception as exp: - logger.error(exp) + merged_plugin_config = PluginExecutor.merge_configs(plugin_config_inst_list) + ref_config = generate_reference_config( + results, plugin_reg, logger, run_plugin_config=merged_plugin_config + ) + if log_path: + path = os.path.join(log_path, "reference_config.json") + else: + path = os.path.join(os.getcwd(), "reference_config.json") + try: + with open(path, "w") as f: + json.dump( + ref_config.model_dump(mode="json", exclude_none=True), + f, + indent=2, + ) + logger.info("Reference config written to: %s", path) + except Exception as exp: + logger.error(exp) if any(result.status > ExecutionStatus.WARNING for result in results): sys.exit(1) diff --git a/nodescraper/cli/helper.py b/nodescraper/cli/helper.py index 173015a9..41e30ede 100644 --- a/nodescraper/cli/helper.py +++ b/nodescraper/cli/helper.py @@ -316,20 +316,27 @@ def extract_analyzer_args_from_model( def generate_reference_config( - results: list[PluginResult], plugin_reg: PluginRegistry, logger: logging.Logger + results: list[PluginResult], + plugin_reg: PluginRegistry, + logger: logging.Logger, + run_plugin_config: Optional[PluginConfig] = None, ) -> PluginConfig: - """Generate reference config from plugin results + """Generate reference config from plugin results. Args: - results (list[PluginResult]): list of plugin results - plugin_reg (PluginRegistry): registry containing all registered plugins - logger (logging.Logger): logger + results: List of plugin results from the run. + plugin_reg: Registry containing all registered plugins. + logger: Logger instance. + run_plugin_config: Optional merged plugin config used for the run; Returns: - PluginConfig: holds model that defines final reference config + PluginConfig: Reference config with plugins dict containing + collection_args and analysis_args for each successful plugin. """ plugin_config = PluginConfig() plugins = {} + run_plugins = (run_plugin_config.plugins if run_plugin_config else {}) or {} + for obj in results: if obj.result_data.collection_result.status != ExecutionStatus.OK: logger.warning( @@ -349,6 +356,10 @@ def generate_reference_config( if obj.source not in plugins: plugins[obj.source] = {} + run_args = run_plugins.get(obj.source) or {} + if run_args.get("collection_args"): + plugins[obj.source]["collection_args"] = dict(run_args["collection_args"]) + a_args = extract_analyzer_args_from_model(plugin, data_model, logger) if a_args: plugins[obj.source]["analysis_args"] = a_args.model_dump(exclude_none=True) diff --git a/nodescraper/connection/inband/inbandremote.py b/nodescraper/connection/inband/inbandremote.py index d5254468..9e2415ed 100644 --- a/nodescraper/connection/inband/inbandremote.py +++ b/nodescraper/connection/inband/inbandremote.py @@ -157,8 +157,8 @@ def run_command( stdin.flush() stdin.channel.shutdown_write() - stdout_str = stdout.read().decode("utf-8") - stderr_str = stderr.read().decode("utf-8") + stdout_str = stdout.read().decode("utf-8", errors="replace") + stderr_str = stderr.read().decode("utf-8", errors="replace") exit_code = stdout.channel.recv_exit_status() except TimeoutError: stderr_str = "Command timed out" diff --git a/nodescraper/connection/redfish/__init__.py b/nodescraper/connection/redfish/__init__.py new file mode 100644 index 00000000..ee812113 --- /dev/null +++ b/nodescraper/connection/redfish/__init__.py @@ -0,0 +1,48 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from .redfish_connection import ( + RedfishConnection, + RedfishConnectionError, + RedfishGetResult, +) +from .redfish_manager import RedfishConnectionManager +from .redfish_oem_diag import ( + collect_oem_diagnostic_data, + get_oem_diagnostic_allowable_values, +) +from .redfish_params import RedfishConnectionParams +from .redfish_path import RedfishPath + +__all__ = [ + "RedfishConnection", + "RedfishConnectionError", + "RedfishGetResult", + "RedfishConnectionManager", + "RedfishConnectionParams", + "RedfishPath", + "collect_oem_diagnostic_data", + "get_oem_diagnostic_allowable_values", +] diff --git a/nodescraper/connection/redfish/redfish_connection.py b/nodescraper/connection/redfish/redfish_connection.py new file mode 100644 index 00000000..8711ff4a --- /dev/null +++ b/nodescraper/connection/redfish/redfish_connection.py @@ -0,0 +1,206 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from __future__ import annotations + +from typing import Any, ClassVar, Optional, Union +from urllib.parse import urljoin + +import requests +import urllib3 # type: ignore[import-untyped] +from pydantic import BaseModel +from requests import Response +from requests.auth import HTTPBasicAuth + +from .redfish_path import RedfishPath + +DEFAULT_REDFISH_API_ROOT = "redfish/v1" + + +class RedfishGetResult(BaseModel): + """Artifact for the result of a Redfish GET request. + Logged under the same filename as inband command artifacts (command_artifacts.json). + """ + + ARTIFACT_LOG_BASENAME: ClassVar[str] = "command_artifacts" + + path: str + success: bool + data: Optional[dict[str, Any]] = None + error: Optional[str] = None + status_code: Optional[int] = None + + +class RedfishConnectionError(Exception): + """Raised when a Redfish API request fails.""" + + def __init__(self, message: str, response: Optional[Response] = None): + super().__init__(message) + self.response = response + + +class RedfishConnection: + """Redfish REST client for GET requests.""" + + def __init__( + self, + base_url: str, + username: str, + password: Optional[str] = None, + timeout: float = 10.0, + use_session_auth: bool = True, + verify_ssl: bool = True, + api_root: Optional[str] = None, + ): + self.base_url = base_url.rstrip("/") + self.api_root = (api_root or DEFAULT_REDFISH_API_ROOT).strip("/") + self.username = username + self.password = password or "" + self.timeout = timeout + self.use_session_auth = use_session_auth + self.verify_ssl = verify_ssl + self._session: Optional[requests.Session] = None + self._session_token: Optional[str] = None + self._session_uri: Optional[str] = None # For logout DELETE + + def _ensure_session(self) -> requests.Session: + if self._session is None: + if not self.verify_ssl: + urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) + self._session = requests.Session() + self._session.verify = self.verify_ssl + self._session.headers["Content-Type"] = "application/json" + self._session.headers["Accept"] = "application/json" + if self.use_session_auth and self.password: + self._login_session() + elif self.password: + self._session.auth = HTTPBasicAuth(self.username, self.password) + return self._session + + def _login_session(self) -> None: + """Create a Redfish session and set X-Auth-Token.""" + assert self._session is not None + sess_url = urljoin(self.base_url + "/", f"{self.api_root}/SessionService/Sessions") + payload = {"UserName": self.username, "Password": self.password} + resp = self._session.post( + sess_url, + json=payload, + timeout=self.timeout, + ) + if not resp.ok: + raise RedfishConnectionError( + f"Session login failed: {resp.status_code} {resp.reason}", response=resp + ) + self._session_token = resp.headers.get("X-Auth-Token") + location = resp.headers.get("Location") + if location: + self._session_uri = ( + location + if location.startswith("http") + else urljoin(self.base_url + "/", location.lstrip("/")) + ) + if self._session_token: + self._session.headers["X-Auth-Token"] = self._session_token + else: + self._session.auth = HTTPBasicAuth(self.username, self.password) + + def get(self, path: RedfishPath) -> dict[str, Any]: + """GET a Redfish path and return the JSON body. path must be a RedfishPath.""" + path_str = str(path) + resp = self.get_response(path_str) + if not resp.ok: + raise RedfishConnectionError( + f"GET {path_str} failed: {resp.status_code} {resp.reason}", + response=resp, + ) + return resp.json() + + def get_response(self, path: Union[str, "RedfishPath"]) -> Response: + """GET a Redfish path and return the raw Response. path may be a string or RedfishPath.""" + path = str(path) + session = self._ensure_session() + url = path if path.startswith("http") else urljoin(self.base_url + "/", path.lstrip("/")) + return session.get(url, timeout=self.timeout) + + def post( + self, path: Union[str, "RedfishPath"], json: Optional[dict[str, Any]] = None + ) -> Response: + """POST to a Redfish path and return the raw Response. path may be a string or RedfishPath.""" + path = str(path) + session = self._ensure_session() + url = path if path.startswith("http") else urljoin(self.base_url + "/", path.lstrip("/")) + return session.post(url, json=json or {}, timeout=self.timeout) + + def run_get(self, path: Union[str, RedfishPath]) -> RedfishGetResult: + """Run a Redfish GET request and return a result object. path may be a string or RedfishPath.""" + path_norm = str(path).strip() + if not path_norm.startswith("/"): + path_norm = "/" + path_norm + path_obj = RedfishPath(path_norm.strip("/")) if isinstance(path, str) else path + try: + data = self.get(path_obj) + return RedfishGetResult( + path=path_norm, + success=True, + data=data, + status_code=200, + ) + except RedfishConnectionError as e: + status = e.response.status_code if e.response is not None else None + return RedfishGetResult( + path=path_norm, + success=False, + error=str(e), + status_code=status, + ) + except Exception as e: + return RedfishGetResult( + path=path_norm, + success=False, + error=str(e), + status_code=None, + ) + + def get_service_root(self) -> dict[str, Any]: + """GET service root (e.g. /redfish/v1/).""" + return self.get(RedfishPath(self.api_root)) + + def close(self) -> None: + """Release session and logout if session auth was used.""" + if self._session and self._session_uri: + try: + self._session.delete(self._session_uri, timeout=self.timeout) + except Exception: + pass + self._session = None + self._session_token = None + self._session_uri = None + + def __enter__(self) -> RedfishConnection: + self._ensure_session() + return self + + def __exit__(self, *args: Any) -> None: + self.close() diff --git a/nodescraper/connection/redfish/redfish_manager.py b/nodescraper/connection/redfish/redfish_manager.py new file mode 100644 index 00000000..4413ee86 --- /dev/null +++ b/nodescraper/connection/redfish/redfish_manager.py @@ -0,0 +1,142 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from __future__ import annotations + +from logging import Logger +from typing import Optional, Union + +from nodescraper.enums import EventCategory, EventPriority, ExecutionStatus +from nodescraper.interfaces.connectionmanager import ConnectionManager +from nodescraper.interfaces.taskresulthook import TaskResultHook +from nodescraper.models import SystemInfo, TaskResult +from nodescraper.utils import get_exception_traceback + +from .redfish_connection import RedfishConnection, RedfishConnectionError +from .redfish_params import RedfishConnectionParams + + +def _build_base_url(host: str, port: Optional[int], use_https: bool) -> str: + scheme = "https" if use_https else "http" + host_str = str(host) + if port is not None: + return f"{scheme}://{host_str}:{port}" + return f"{scheme}://{host_str}" + + +class RedfishConnectionManager(ConnectionManager[RedfishConnection, RedfishConnectionParams]): + """Connection manager for Redfish (BMC) API.""" + + def __init__( + self, + system_info: SystemInfo, + logger: Optional[Logger] = None, + max_event_priority_level: Union[EventPriority, str] = EventPriority.CRITICAL, + parent: Optional[str] = None, + task_result_hooks: Optional[list[TaskResultHook]] = None, + connection_args: Optional[RedfishConnectionParams] = None, + **kwargs, + ): + super().__init__( + system_info, + logger, + max_event_priority_level, + parent, + task_result_hooks, + connection_args, + **kwargs, + ) + + def connect(self) -> TaskResult: + """Connect to the Redfish service and perform a simple GET to verify.""" + if not self.connection_args: + self._log_event( + category=EventCategory.RUNTIME, + description="No Redfish connection parameters provided", + priority=EventPriority.CRITICAL, + console_log=True, + ) + self.result.status = ExecutionStatus.EXECUTION_FAILURE + return self.result + + # Accept dict from JSON config; convert to RedfishConnectionParams + raw = self.connection_args + if isinstance(raw, dict): + params = RedfishConnectionParams.model_validate(raw) + elif isinstance(raw, RedfishConnectionParams): + params = raw + else: + self._log_event( + category=EventCategory.RUNTIME, + description="Redfish connection_args must be dict or RedfishConnectionParams", + priority=EventPriority.CRITICAL, + console_log=True, + ) + self.result.status = ExecutionStatus.EXECUTION_FAILURE + return self.result + + password = params.password.get_secret_value() if params.password else None + base_url = _build_base_url(str(params.host), params.port, params.use_https) + + try: + self.logger.info("Connecting to Redfish at %s", base_url) + self.connection = RedfishConnection( + base_url=base_url, + username=params.username, + password=password, + timeout=params.timeout_seconds, + use_session_auth=params.use_session_auth, + verify_ssl=params.verify_ssl, + api_root=params.api_root, + ) + self.connection._ensure_session() + self.connection.get_service_root() + except RedfishConnectionError as exc: + self._log_event( + category=EventCategory.RUNTIME, + description=f"Redfish connection error: {exc}", + data=get_exception_traceback(exc) if exc.response is None else None, + priority=EventPriority.CRITICAL, + console_log=True, + ) + self.result.status = ExecutionStatus.EXECUTION_FAILURE + self.connection = None + except Exception as exc: + self._log_event( + category=EventCategory.RUNTIME, + description=f"Redfish connection failed: {exc}", + data=get_exception_traceback(exc), + priority=EventPriority.CRITICAL, + console_log=True, + ) + self.result.status = ExecutionStatus.EXECUTION_FAILURE + self.connection = None + return self.result + + def disconnect(self) -> None: + """Disconnect and release the Redfish session.""" + if self.connection is not None: + self.connection.close() + super().disconnect() diff --git a/nodescraper/connection/redfish/redfish_oem_diag.py b/nodescraper/connection/redfish/redfish_oem_diag.py new file mode 100644 index 00000000..94133337 --- /dev/null +++ b/nodescraper/connection/redfish/redfish_oem_diag.py @@ -0,0 +1,381 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from __future__ import annotations + +import json +import logging +import re +import time +from pathlib import Path +from typing import Any, Optional + +from requests import Response +from requests.status_codes import codes + +from nodescraper.enums import TaskState + +from .redfish_connection import RedfishConnection, RedfishConnectionError +from .redfish_path import RedfishPath + +_module_logger = logging.getLogger(__name__) + +_LOG_RESPONSE_BODY_LIMIT = 1500 + + +def _log_collect_diag_response( + log: logging.Logger, + status: int, + body: Any, + raw_text: str = "", +) -> None: + """Log CollectDiagnosticData response at INFO when no Location/TaskMonitor found.""" + if isinstance(body, dict): + snippet = json.dumps(body, indent=2) + else: + snippet = raw_text or str(body) + if len(snippet) > _LOG_RESPONSE_BODY_LIMIT: + snippet = snippet[:_LOG_RESPONSE_BODY_LIMIT] + "... (truncated)" + log.info( + "CollectDiagnosticData response (no Location/TaskMonitor): status=%s body=%s", + status, + snippet, + ) + + +# Redfish JSON key for resource link +RF_ODATA_ID = "@odata.id" + +# @Redfish.AllowableValues: Redfish annotation for the list of allowable values for a string +REDFISH_ANNOTATION_ALLOWABLE_VALUES = "Redfish.AllowableValues" + +# OEMDiagnosticDataType: LogService CollectDiagnosticData action parameter +OEM_DIAGNOSTIC_DATA_TYPE_PARAM = "OEMDiagnosticDataType" + +RF_ANNOTATION_ALLOWABLE = f"{OEM_DIAGNOSTIC_DATA_TYPE_PARAM}@{REDFISH_ANNOTATION_ALLOWABLE_VALUES}" + +# Default max wait for BMC task (seconds) +DEFAULT_TASK_TIMEOUT_S = 1800 + + +def get_oem_diagnostic_allowable_values( + conn: RedfishConnection, + log_service_path: str, +) -> Optional[list[str]]: + """GET the LogService and return OEMDiagnosticDataType@Redfish.AllowableValues if present. + + Args: + conn: Redfish connection (session established). + log_service_path: Path to the LogService (e.g. redfish/v1/Systems/UBB/LogServices/DiagLogs). + + Returns: + List of allowable type strings, or None if not found / GET failed. + """ + path = log_service_path.strip().strip("/") + try: + data = conn.get(RedfishPath(path)) + except RedfishConnectionError: + return None + if not isinstance(data, dict): + return None + actions = data.get("Actions") or {} + collect_action = actions.get("LogService.CollectDiagnosticData") or actions.get( + "#LogService.CollectDiagnosticData" + ) + if isinstance(collect_action, dict): + allow = collect_action.get(RF_ANNOTATION_ALLOWABLE) + if isinstance(allow, list) and all(isinstance(x, str) for x in allow): + return list(allow) + return None + + +def _resolve_path(conn: RedfishConnection, path: str) -> str: + """Return full URL for a path (relative to base_url).""" + if path.startswith("http"): + return path + path = path.lstrip("/") + base = conn.base_url.rstrip("/") + return f"{base}/{path}" + + +def _get_path_from_connection(conn: RedfishConnection, path: str) -> str: + """Return path relative to BMC (no host). For use with conn.get_response(path).""" + if path.startswith("http"): + # Strip base URL to get path under /redfish/v1/... + base = conn.base_url.rstrip("/") + if path.startswith(base + "/"): + return path[len(base) :].lstrip("/") + return path + return path.lstrip("/") + + +def _get_task_monitor_uri(body: dict, conn: RedfishConnection) -> Optional[str]: + """Extract task monitor URI from a Task-like body (DSP0266 or OEM variants). + + TaskMonitor may be a string URI or an object with @odata.id (e.g. TaskService/TaskMonitors/378). + """ + + def _resolve_uri(uri: str) -> str: + if not uri.startswith("http"): + uri = _resolve_path(conn, uri.strip().lstrip("/")) + return uri + + for key in ("TaskMonitor", "Monitor", "TaskMonitorUri"): + val = body.get(key) + if isinstance(val, str) and val.strip(): + return _resolve_uri(val) + if isinstance(val, dict): + odata_id = val.get(RF_ODATA_ID) + if isinstance(odata_id, str) and odata_id.strip(): + return _resolve_uri(odata_id) + oem = body.get("Oem") + if isinstance(oem, dict): + for vendor_dict in oem.values(): + if isinstance(vendor_dict, dict): + for k in ("TaskMonitor", "Monitor", "TaskMonitorUri"): + val = vendor_dict.get(k) + if isinstance(val, str) and val.strip(): + return _resolve_uri(val) + if isinstance(val, dict): + odata_id = val.get(RF_ODATA_ID) + if isinstance(odata_id, str) and odata_id.strip(): + return _resolve_uri(odata_id) + odata_id = body.get(RF_ODATA_ID) + if isinstance(odata_id, str) and odata_id.strip(): + base_uri = _resolve_uri(odata_id) + return f"{base_uri.rstrip('/')}/Monitor" + return None + + +# Workaround for LogEntry URL: some BMCs 404 when URL includes port +def _strip_port_from_url(url: str) -> Optional[str]: + """Return URL with port removed from authority (e.g. host:443 -> host).""" + if re.search(r"://[^/]+:\d+", url): + return re.sub(r"(://[^:/]+):\d+", r"\1", url, count=1) + return None + + +def _download_log_and_save( + conn: RedfishConnection, + log_entry_json: dict[str, Any], + oem_diagnostic_type: str, + output_dir: Optional[Path], + log: logging.Logger, +) -> Optional[bytes]: + # Download binary log if AdditionalDataURI present + log_bytes: Optional[bytes] = None + data_uri = log_entry_json.get("AdditionalDataURI") + if data_uri: + data_path = _get_path_from_connection(conn, data_uri) + data_resp = conn.get_response(data_path) + if data_resp.status_code == codes.ok: + log_bytes = data_resp.content + + if output_dir: + output_dir = Path(output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + if log_bytes is not None: + archive_path = output_dir / f"{oem_diagnostic_type}.tar.xz" + archive_path.write_bytes(log_bytes) + log.info("Log written to disk: %s -> %s", oem_diagnostic_type, archive_path.name) + metadata_file = output_dir / f"{oem_diagnostic_type}_log_entry.json" + try: + metadata_file.write_text(json.dumps(log_entry_json, indent=2), encoding="utf-8") + log.info( + "Log metadata written to disk: %s -> %s", oem_diagnostic_type, metadata_file.name + ) + except Exception as e: + log.exception("Failed to write log metadata to %s: %s", metadata_file, e) + + return log_bytes + + +def collect_oem_diagnostic_data( + conn: RedfishConnection, + log_service_path: str, + oem_diagnostic_type: Optional[str] = None, + task_timeout_s: int = DEFAULT_TASK_TIMEOUT_S, + output_dir: Optional[Path] = None, + validate_type: bool = False, + allowed_types: Optional[list[str]] = None, + logger: Optional[logging.Logger] = None, +) -> tuple[Optional[bytes], Optional[dict[str, Any]], Optional[str]]: + """ + Initiate OEM diagnostic collection, poll until done, download log and metadata. + + Args: + conn: Redfish connection (session already established). + log_service_path: Path to LogService under Systems, e.g. + "redfish/v1/Systems/UBB/LogServices/DiagLogs" (no leading slash). + oem_diagnostic_type: OEM type for DiagnosticDataType OEM (e.g. "JournalControl", "AllLogs"). Required. + task_timeout_s: Max seconds to wait for BMC task + output_dir: If set, save log archive and LogEntry JSON here. + validate_type: If True, require oem_diagnostic_type to be in allowed_types. + allowed_types: Allowable OEM diagnostic types for validation when validate_type is True. + logger: Logger + + Returns: + (log_bytes, log_entry_metadata_dict, error_message). + On success: (bytes, dict, None). On failure: (None, None, error_str). + """ + log = logger if logger is not None else _module_logger + if not oem_diagnostic_type or not oem_diagnostic_type.strip(): + return None, None, "oem_diagnostic_type is required" + if validate_type and allowed_types and oem_diagnostic_type not in allowed_types: + return ( + None, + None, + f"oem_diagnostic_type {oem_diagnostic_type!r} not in allowed types", + ) + path_prefix = log_service_path.rstrip("/") + action_path = f"{path_prefix}/Actions/LogService.CollectDiagnosticData" + payload = {"DiagnosticDataType": "OEM", "OEMDiagnosticDataType": oem_diagnostic_type} + + try: + resp: Response = conn.post(action_path, json=payload) + except RedfishConnectionError as e: + return None, None, str(e) + + if resp.status_code not in (codes.ok, codes.accepted): + return ( + None, + None, + f"Unexpected status {resp.status_code} for CollectDiagnosticData: {resp.text}", + ) + + # DSP0266 12.2: 202 shall include Location (task monitor URI), optionally Retry-After + location_header = resp.headers.get("Location") or resp.headers.get("Content-Location") + if location_header and not location_header.startswith("http"): + location_header = _resolve_path(conn, location_header) + sleep_s = int(resp.headers.get("Retry-After", 1) or 1) + try: + oem_response = resp.json() + except Exception: + oem_response = {} + + # 200 OK with TaskState=Completed: synchronous completion, body is the Task + task_json: Optional[dict[str, Any]] = None + if resp.status_code == codes.ok and isinstance(oem_response, dict): + if oem_response.get("TaskState") == TaskState.completed.value: + headers_list = oem_response.get("Payload", {}).get("HttpHeaders", []) or [] + if any(isinstance(h, str) and "Location:" in h for h in headers_list): + task_json = oem_response + + # When TaskMonitor is implemented + task_monitor: Optional[str] = None + task_path: Optional[str] = None + if task_json is None: + task_monitor = location_header or _get_task_monitor_uri(oem_response, conn) + if oem_response.get(RF_ODATA_ID): + task_path = _get_path_from_connection(conn, oem_response[RF_ODATA_ID]) + if not task_monitor and task_path: + task_resp = conn.get_response(task_path) + if task_resp.status_code == codes.ok: + fetched = task_resp.json() + task_monitor = _get_task_monitor_uri(fetched, conn) + if not task_monitor: + _log_collect_diag_response( + log, resp.status_code, oem_response, getattr(resp, "text", "") or "" + ) + return None, None, "No TaskMonitor in response and no Location header" + + if task_json is None: + assert task_monitor is not None + # Poll task monitor until no longer 202/404 (e.g. GET /redfish/v1/TaskService/TaskMonitors/378) + start = time.time() + poll_resp = None + while True: + if time.time() - start > task_timeout_s: + return None, None, f"Task did not complete within {task_timeout_s}s" + time.sleep(sleep_s) + monitor_path = _get_path_from_connection(conn, task_monitor) + poll_resp = conn.get_response(monitor_path) + if poll_resp.status_code not in (codes.accepted, codes.not_found): + break + + # TaskMonitor response body has @odata.id pointing to the Task (e.g. /redfish/v1/TaskService/Tasks/5) + try: + monitor_body = poll_resp.json() if poll_resp else {} + except Exception: + monitor_body = {} + task_uri_from_monitor = ( + monitor_body.get(RF_ODATA_ID) if isinstance(monitor_body, dict) else None + ) + if isinstance(task_uri_from_monitor, str) and task_uri_from_monitor.strip(): + task_path = _get_path_from_connection(conn, task_uri_from_monitor.strip()) + elif not task_path: + task_path = _get_path_from_connection(conn, task_monitor.rstrip("/").rsplit("/", 1)[0]) + task_resp = conn.get_response(task_path) + if task_resp.status_code != codes.ok: + return None, None, f"Task GET failed: {task_resp.status_code}" + task_json = task_resp.json() + if task_json.get("TaskState") != TaskState.completed.value: + return None, None, f"Task did not complete: TaskState={task_json.get('TaskState')}" + + # LogEntry location from Payload.HttpHeaders + headers_list = task_json.get("Payload", {}).get("HttpHeaders", []) or [] + location = None + for header in headers_list: + if isinstance(header, str) and "Location:" in header: + location = header.split("Location:", 1)[-1].strip() + break + if not location: + return None, None, "Location header missing in task Payload.HttpHeaders" + if location.startswith("http"): + log_entry_path = location + else: + log_entry_path = location.lstrip("/") + + # GET LogEntry (some BMCs 404 when URL includes explicit port; try without port first) + log_entry_alt = _strip_port_from_url(log_entry_path) + if log_entry_alt is None and not log_entry_path.startswith("http"): + log_entry_alt = _strip_port_from_url( + conn.base_url.rstrip("/") + "/" + log_entry_path.lstrip("/") + ) + paths_to_try = [log_entry_alt, log_entry_path] if log_entry_alt else [log_entry_path] + log_entry_json = None + first_status: Optional[int] = codes.not_found + first_error = "" + for try_path in paths_to_try: + if try_path is None: + continue + try: + log_entry_resp = conn.get_response(try_path) + if log_entry_resp.status_code == codes.ok: + log_entry_json = log_entry_resp.json() + break + if try_path == paths_to_try[0]: + first_status = log_entry_resp.status_code + except Exception as e: + if try_path == paths_to_try[0]: + first_status = None + first_error = str(e) + continue + if log_entry_json is None: + err = first_error if first_status is None else f"status {first_status}" + return None, None, f"LogEntry GET failed: {err} (GET {log_entry_path})" + + log_bytes = _download_log_and_save(conn, log_entry_json, oem_diagnostic_type, output_dir, log) + return log_bytes, log_entry_json, None diff --git a/nodescraper/connection/redfish/redfish_params.py b/nodescraper/connection/redfish/redfish_params.py new file mode 100644 index 00000000..7d9b5d5f --- /dev/null +++ b/nodescraper/connection/redfish/redfish_params.py @@ -0,0 +1,53 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from typing import Optional, Union + +from pydantic import BaseModel, ConfigDict, Field, SecretStr +from pydantic.networks import IPvAnyAddress + +from .redfish_connection import DEFAULT_REDFISH_API_ROOT + + +class RedfishConnectionParams(BaseModel): + """Connection parameters for a Redfish (BMC) API endpoint.""" + + model_config = ConfigDict(arbitrary_types_allowed=True) + + host: Union[IPvAnyAddress, str] + username: str + password: Optional[SecretStr] = None + port: Optional[int] = Field(default=None, ge=1, le=65535) + use_https: bool = True + verify_ssl: bool = Field( + default=True, + description="Verify HTTPS server certificate. Set False for BMCs with self-signed certs.", + ) + timeout_seconds: float = Field(default=10.0, gt=0, le=300) + use_session_auth: bool = True + api_root: str = Field( + default=DEFAULT_REDFISH_API_ROOT, + description="Redfish API path (e.g. 'redfish/v1'). Override for a different API version.", + ) diff --git a/nodescraper/connection/redfish/redfish_path.py b/nodescraper/connection/redfish/redfish_path.py new file mode 100644 index 00000000..0cd14024 --- /dev/null +++ b/nodescraper/connection/redfish/redfish_path.py @@ -0,0 +1,61 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +"""Fluent Redfish path builder with pathlib-like / syntax and optional parameter substitution.""" +from __future__ import annotations + +from typing import Dict + + +class RedfishPath: + """Fluent interface for building Redfish URI paths.""" + + def __init__(self, base: str = "") -> None: + self._path = (base or "").strip().strip("/") + self._params: Dict[str, str] = {} + + def __truediv__(self, segment: str) -> RedfishPath: + """Allow path / \"segment\" syntax. Leading/trailing slashes on segment are stripped.""" + seg = (segment or "").strip().strip("/") + if not seg: + return RedfishPath(self._path) + new_path = RedfishPath() + new_path._path = f"{self._path}/{seg}" if self._path else seg + new_path._params = dict(self._params) + return new_path + + def __call__(self, **params: str) -> str: + """Substitute placeholders in the path and return the final path string. + + Placeholders use {key}; e.g. path \"Systems/{id}/LogServices/DiagLogs\" with (id=\"UBB\") + returns \"Systems/UBB/LogServices/DiagLogs\". + """ + result = self._path + for key, value in params.items(): + result = result.replace(f"{{{key}}}", value) + return result + + def __str__(self) -> str: + return self._path diff --git a/nodescraper/enums/__init__.py b/nodescraper/enums/__init__.py index 411f8217..ddbd351f 100644 --- a/nodescraper/enums/__init__.py +++ b/nodescraper/enums/__init__.py @@ -29,6 +29,7 @@ from .osfamily import OSFamily from .systeminteraction import SystemInteractionLevel from .systemlocation import SystemLocation +from .taskstate import TaskState __all__ = [ "ExecutionStatus", @@ -37,4 +38,5 @@ "SystemLocation", "EventCategory", "EventPriority", + "TaskState", ] diff --git a/nodescraper/enums/eventcategory.py b/nodescraper/enums/eventcategory.py index 553119a8..42aa6c98 100644 --- a/nodescraper/enums/eventcategory.py +++ b/nodescraper/enums/eventcategory.py @@ -65,6 +65,8 @@ class EventCategory(AutoNameStrEnum): Network, IT issues, Downtime - NETWORK Network configuration, interfaces, routing, neighbors, ethtool data + - TELEMETRY + Telemetry / monitored data checks (e.g. Redfish endpoint constraint violations) - RUNTIME Framework issues, does not include content failures - UNKNOWN @@ -85,5 +87,6 @@ class EventCategory(AutoNameStrEnum): BIOS = auto() INFRASTRUCTURE = auto() NETWORK = auto() + TELEMETRY = auto() RUNTIME = auto() UNKNOWN = auto() diff --git a/nodescraper/enums/taskstate.py b/nodescraper/enums/taskstate.py new file mode 100644 index 00000000..46d9d3fa --- /dev/null +++ b/nodescraper/enums/taskstate.py @@ -0,0 +1,44 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +import enum + + +class TaskState(enum.Enum): + """Redfish Task resource TaskState.""" + + new = "New" + starting = "Starting" + running = "Running" + suspended = "Suspended" + interrupted = "Interrupted" + pending = "Pending" + stopping = "Stopping" + completed = "Completed" + killed = "Killed" + exception = "Exception" + service = "Service" + cancelling = "Cancelling" + cancelled = "Cancelled" diff --git a/nodescraper/interfaces/dataplugin.py b/nodescraper/interfaces/dataplugin.py index f4aa622c..1f948529 100644 --- a/nodescraper/interfaces/dataplugin.py +++ b/nodescraper/interfaces/dataplugin.py @@ -23,8 +23,11 @@ # SOFTWARE. # ############################################################################### +import json import logging -from typing import Generic, Optional, Type, Union +import os +from pathlib import Path +from typing import Any, Generic, Optional, Type, Union from nodescraper.enums import EventPriority, ExecutionStatus, SystemInteractionLevel from nodescraper.generictypes import TAnalyzeArg, TCollectArg, TDataModel @@ -33,11 +36,13 @@ from nodescraper.interfaces.plugin import PluginInterface from nodescraper.models import ( AnalyzerArgs, + DataModel, DataPluginResult, PluginResult, SystemInfo, TaskResult, ) +from nodescraper.utils import pascal_to_snake from .connectionmanager import TConnectArg, TConnectionManager from .task import SystemCompatibilityError @@ -211,6 +216,7 @@ def collect( max_event_priority_level=max_event_priority_level, parent=self.__class__.__name__, task_result_hooks=self.task_result_hooks, + log_path=self.log_path, ) self.collection_result, self._data = collection_task.collect_data(collection_args) @@ -369,3 +375,133 @@ def run( analysis_result=self.analysis_result, ), ) + + @classmethod + def find_datamodel_path_in_run(cls, run_path: str) -> Optional[str]: + """Find this plugin's collector datamodel file under a scraper run directory. + + Args: + run_path: Path to a scraper log run directory (e.g. scraper_logs_*). + + Returns: + Absolute path to the datamodel file, or None if not found. + """ + run_path = os.path.abspath(run_path) + if not os.path.isdir(run_path): + return None + collector_cls = getattr(cls, "COLLECTOR", None) + data_model_cls = getattr(cls, "DATA_MODEL", None) + if not collector_cls or not data_model_cls: + return None + collector_dir = os.path.join( + run_path, + pascal_to_snake(cls.__name__), + pascal_to_snake(collector_cls.__name__), + ) + if not os.path.isdir(collector_dir): + return None + result_path = os.path.join(collector_dir, "result.json") + if not os.path.isfile(result_path): + return None + try: + res_payload = json.loads(Path(result_path).read_text(encoding="utf-8")) + if res_payload.get("parent") != cls.__name__: + return None + except (json.JSONDecodeError, OSError): + return None + want_json = data_model_cls.__name__.lower() + ".json" + for fname in os.listdir(collector_dir): + low = fname.lower() + if low.endswith("datamodel.json") or low == want_json: + return os.path.join(collector_dir, fname) + if low.endswith(".log"): + return os.path.join(collector_dir, fname) + return None + + @classmethod + def load_datamodel_from_path(cls, dm_path: str) -> Optional[TDataModel]: + """Load this plugin's DATA_MODEL from a file path (JSON or .log). + + Args: + dm_path: Path to datamodel JSON or to a .log file (if DATA_MODEL + implements import_model for that format). + + Returns: + Instance of DATA_MODEL or None if load fails. + """ + dm_path = os.path.abspath(dm_path) + if not os.path.isfile(dm_path): + return None + data_model_cls = getattr(cls, "DATA_MODEL", None) + if not data_model_cls: + return None + try: + if dm_path.lower().endswith(".log"): + import_model = getattr(data_model_cls, "import_model", None) + if not callable(import_model): + return None + base_import = getattr(DataModel.import_model, "__func__", DataModel.import_model) + if getattr(import_model, "__func__", import_model) is base_import: + return None + return import_model(dm_path) + with open(dm_path, encoding="utf-8") as f: + data = json.load(f) + return data_model_cls.model_validate(data) + except (json.JSONDecodeError, OSError, Exception): + return None + + @classmethod + def get_extracted_errors(cls, data_model: DataModel) -> Optional[list[str]]: + """Compute extracted errors from datamodel for compare-runs (in memory only). + + Args: + data_model: Loaded DATA_MODEL instance. + + Returns: + Sorted list of error match strings, or None if not applicable. + """ + get_content = getattr(data_model, "get_compare_content", None) + if not callable(get_content): + return None + try: + content = get_content() + except Exception: + return None + if not isinstance(content, str): + return None + analyzer_cls = getattr(cls, "ANALYZER", None) + if not analyzer_cls: + return None + get_matches = getattr(analyzer_cls, "get_error_matches", None) + if not callable(get_matches): + return None + try: + matches = get_matches(content) + return sorted(matches) if matches is not None else None + except Exception: + return None + + @classmethod + def load_run_data(cls, run_path: str) -> Optional[dict[str, Any]]: + """Load this plugin's run data from a scraper run directory for comparison. + + Args: + run_path: Path to a scraper log run directory or to a datamodel file. + + Returns: + Dict suitable for diffing with another run, or None if not found. + """ + run_path = os.path.abspath(run_path) + if not os.path.exists(run_path): + return None + dm_path = run_path if os.path.isfile(run_path) else cls.find_datamodel_path_in_run(run_path) + if not dm_path: + return None + data_model = cls.load_datamodel_from_path(dm_path) + if data_model is None: + return None + out = data_model.model_dump(mode="json") + extracted = cls.get_extracted_errors(data_model) + if extracted is not None: + out["extracted_errors"] = extracted + return out diff --git a/nodescraper/models/taskresult.py b/nodescraper/models/taskresult.py index d04618f4..3a4a2952 100644 --- a/nodescraper/models/taskresult.py +++ b/nodescraper/models/taskresult.py @@ -29,7 +29,13 @@ import os from typing import Any, Optional -from pydantic import BaseModel, Field, field_serializer, field_validator +from pydantic import ( + BaseModel, + Field, + field_serializer, + field_validator, + model_validator, +) from nodescraper.enums import EventPriority, ExecutionStatus from nodescraper.utils import get_unique_filename, pascal_to_snake @@ -54,6 +60,20 @@ class TaskResult(BaseModel): task: Optional[str] = None parent: Optional[str] = None artifacts: list[BaseModel] = Field(default_factory=list) + details: dict = Field(default_factory=dict) + + @model_validator(mode="before") + @classmethod + def _source_source_type_aliases(cls, data: Any) -> Any: + """Accept source/source_type.""" + if isinstance(data, dict): + data = dict(data) + if "source" in data and "task" not in data: + data["task"] = data.pop("source") + if "source_type" in data and "parent" not in data: + data["parent"] = data.pop("source_type") + return data + events: list[Event] = Field(default_factory=list) start_time: datetime.datetime = Field(default_factory=datetime.datetime.now) end_time: datetime.datetime = Field(default_factory=datetime.datetime.now) @@ -107,14 +127,24 @@ def duration(self) -> Optional[str]: @property def source(self) -> str: - """Task/source name (alias for task for error-scraper compatibility).""" + """Task/source name.""" return self.task or "" + @source.setter + def source(self, value: str) -> None: + """Set task from source.""" + self.task = value if value else None + @property def source_type(self) -> str: - """Task/source type (alias for parent for error-scraper compatibility).""" + """Task/source type.""" return self.parent or "" + @source_type.setter + def source_type(self, value: str) -> None: + """Set parent from source_type.""" + self.parent = value if value else None + @property def summary_dict(self) -> dict: """Summary dict for logging/display (task_name, task_type, task_result, event_count, duration).""" @@ -145,7 +175,14 @@ def log_result(self, log_path: str) -> None: if isinstance(artifact, BaseFileArtifact): artifact.log_model(log_path) else: - name = f"{pascal_to_snake(artifact.__class__.__name__)}s" + name = ( + getattr( + artifact.__class__, + "ARTIFACT_LOG_BASENAME", + None, + ) + or f"{pascal_to_snake(artifact.__class__.__name__)}s" + ) if name in artifact_map: artifact_map[name].append(artifact.model_dump(mode="json")) else: @@ -208,6 +245,16 @@ def _get_event_summary(self) -> str: return "; ".join(summary_parts) + def _get_event_priorities(self) -> str: + warnings = sum(1 for e in self.events if e.priority == EventPriority.WARNING) + errors = sum(1 for e in self.events if e.priority >= EventPriority.ERROR) + parts = [] + if warnings: + parts.append(f"{warnings} warnings") + if errors: + parts.append(f"{errors} errors") + return "|".join(parts) + def _update_status(self) -> None: """Update overall status based on event priority""" self.status = ExecutionStatus.OK diff --git a/nodescraper/pluginexecutor.py b/nodescraper/pluginexecutor.py index d03010c6..a8da102b 100644 --- a/nodescraper/pluginexecutor.py +++ b/nodescraper/pluginexecutor.py @@ -96,12 +96,38 @@ def __init__( self.logger.info("System Platform: %s", self.system_info.platform) self.logger.info("System location: %s", self.system_info.location) + @staticmethod + def _deep_merge_plugin_args(existing: dict, incoming: dict) -> dict: + """Merge incoming plugin args into existing; do not let empty dicts overwrite + For instance when --plugin--configs and run-plugin is used in same cmd.""" + result = dict(existing) + for k, v in incoming.items(): + if v is None: + continue + if k in ("collection_args", "analysis_args") and isinstance(v, dict): + existing_sub = result.get(k) + if isinstance(existing_sub, dict) and v: + result[k] = {**existing_sub, **v} + elif v: + result[k] = dict(v) + else: + result[k] = v + return result + @staticmethod def merge_configs(plugin_configs: list[PluginConfig]) -> PluginConfig: merged_config = PluginConfig() for config in plugin_configs: merged_config.global_args.update(config.global_args) - merged_config.plugins.update(config.plugins) + for plugin_name, plugin_args in config.plugins.items(): + if plugin_name in merged_config.plugins and plugin_args: + merged_config.plugins[plugin_name] = PluginExecutor._deep_merge_plugin_args( + merged_config.plugins[plugin_name], plugin_args + ) + elif plugin_name in merged_config.plugins: + pass # dont overwrite with empty from run-plugins subparser + else: + merged_config.plugins[plugin_name] = dict(plugin_args) merged_config.result_collators.update(config.result_collators) return merged_config diff --git a/nodescraper/plugins/inband/amdsmi/amdsmi_analyzer.py b/nodescraper/plugins/inband/amdsmi/amdsmi_analyzer.py index d3db023f..ffe86cd7 100644 --- a/nodescraper/plugins/inband/amdsmi/amdsmi_analyzer.py +++ b/nodescraper/plugins/inband/amdsmi/amdsmi_analyzer.py @@ -35,7 +35,6 @@ AmdSmiDataModel, AmdSmiMetric, AmdSmiStatic, - AmdSmiTstData, EccData, Fw, Partition, @@ -47,7 +46,7 @@ class AmdSmiAnalyzer(CperAnalysisTaskMixin, DataAnalyzer[AmdSmiDataModel, None]): - """Check AMD SMI Application data for PCIe, ECC errors, CPER data, and analyze amdsmitst metrics""" + """Check AMD SMI Application data for PCIe, ECC errors, and CPER data.""" DATA_MODEL = AmdSmiDataModel @@ -667,6 +666,9 @@ def check_expected_xgmi_link_speed( ) return + expected_str = ", ".join(str(s) for s in expected_xgmi_speed) + mismatches: list[dict] = [] + for xgmi_data in xgmi_metric: link_metric = xgmi_data.link_metrics try: @@ -700,32 +702,26 @@ def check_expected_xgmi_link_speed( continue if xgmi_float not in expected_xgmi_speed: - self._log_event( - category=EventCategory.IO, - description="XGMI link speed is not as expected", - priority=EventPriority.ERROR, - data={ + mismatches.append( + { "gpu": xgmi_data.gpu, - "xgmi_bit_rate": xgmi_float, - "expected_xgmi_speed": expected_xgmi_speed, - }, - console_log=True, + "actual_gt_s": xgmi_float, + "expected_gt_s": expected_str, + } ) - def check_amdsmitst(self, amdsmitst_data: AmdSmiTstData): - """Check AMD SMI test results - - Args: - amdsmitst_data (AmdSmiTstData): AMD SMI test data - """ - if amdsmitst_data.failed_test_count > 0: + if mismatches: + details = "; ".join( + f"GPU {m['gpu']} {m['actual_gt_s']} GT/s (expected {m['expected_gt_s']})" + for m in mismatches + ) self._log_event( - category=EventCategory.APPLICATION, - description=f"{amdsmitst_data.failed_test_count} failed tests running amdsmitst", + category=EventCategory.IO, + description=f"XGMI link speed is not as expected: {details}", priority=EventPriority.ERROR, data={ - "failed_test_count": amdsmitst_data.failed_test_count, - "failed_tests": amdsmitst_data.failed_tests, + "expected_gt_s": expected_str, + "mismatches": mismatches, }, console_log=True, ) @@ -815,7 +811,4 @@ def analyze_data( data.xgmi_metric, expected_xgmi_speed=args.expected_xgmi_speed ) - if data.amdsmitst_data and data.amdsmitst_data.failed_test_count > 0: - self.check_amdsmitst(data.amdsmitst_data) - return self.result diff --git a/nodescraper/plugins/inband/amdsmi/amdsmi_collector.py b/nodescraper/plugins/inband/amdsmi/amdsmi_collector.py index 33a36616..860c0e0f 100644 --- a/nodescraper/plugins/inband/amdsmi/amdsmi_collector.py +++ b/nodescraper/plugins/inband/amdsmi/amdsmi_collector.py @@ -29,7 +29,7 @@ from tarfile import TarFile from typing import Any, Dict, List, Optional, Union -from pydantic import ValidationError +from pydantic import BaseModel, ValidationError from nodescraper.base.inbandcollectortask import InBandDataCollector from nodescraper.enums import EventCategory, EventPriority, ExecutionStatus, OSFamily @@ -38,8 +38,10 @@ from nodescraper.plugins.inband.amdsmi.amdsmidata import ( AmdSmiDataModel, AmdSmiListItem, + AmdSmiMetric, AmdSmiStatic, AmdSmiVersion, + BadPages, EccState, Fw, FwListItem, @@ -65,7 +67,10 @@ StaticVbios, StaticVram, StaticXgmiPlpd, + Topo, ValueUnit, + XgmiLinks, + XgmiMetrics, ) from nodescraper.plugins.inband.amdsmi.collector_args import AmdSmiCollectorArgs from nodescraper.utils import get_exception_traceback @@ -87,6 +92,11 @@ class AmdSmiCollector(InBandDataCollector[AmdSmiDataModel, AmdSmiCollectorArgs]) CMD_FIRMWARE = "firmware --json" CMD_STATIC = "static -g all --json" CMD_STATIC_GPU = "static -g {gpu_id} --json" + CMD_TOPOLOGY = "topology" + CMD_METRIC = "metric -g all" + CMD_BAD_PAGES = "bad-pages" + CMD_XGMI_METRIC = "xgmi -m" + CMD_XGMI_LINK = "xgmi -l" CMD_RAS = "ras --cper --folder={folder}" CMD_RAS_AFID = "ras --afid --cper-file {cper_file}" @@ -317,10 +327,125 @@ def _normalize(self, val: object, default: str = "unknown", slot_type: bool = Fa if u == "CEM": return "CEM" return "Unknown" - return s - def _get_amdsmi_data(self) -> Optional[AmdSmiDataModel]: + def _build_amdsmi_sub_data( + self, + model_class: type[BaseModel], + json_data: Optional[Union[dict, list]], + *, + model_name: Optional[str] = None, + ) -> Optional[Union[list, Any]]: + """Build list or single instance from amd-smi JSON using a Pydantic model. + + Args: + model_class: Pydantic model class (e.g. Topo, BadPages, AmdSmiMetric). + json_data: Raw dict or list from amd-smi --json. + model_name: Optional name for logging (defaults to model_class.__name__). + + Returns: + List of model instances, single instance, or None on error. + """ + name = model_name or model_class.__name__ + if json_data is None: + return None + try: + if isinstance(json_data, list): + out: List[Any] = [] + for item in json_data: + if not isinstance(item, dict): + continue + try: + out.append(model_class.model_validate(item)) + except ValidationError as err: + self._log_event( + category=EventCategory.APPLICATION, + description=f"Failed to build {name} entry; skipping", + data={ + "errors": err.errors(include_url=False), + "item_keys": list(item.keys()), + }, + priority=EventPriority.WARNING, + ) + return out + if isinstance(json_data, dict): + return model_class.model_validate(json_data) + return None + except ValidationError as err: + self._log_event( + category=EventCategory.APPLICATION, + description=f"Failed to build {name}", + data={"errors": err.errors(include_url=False)}, + priority=EventPriority.WARNING, + ) + return None + + def get_topology(self) -> Optional[List[Topo]]: + """Get topology from amd-smi topology --json.""" + ret = self._run_amd_smi_dict(self.CMD_TOPOLOGY) + if ret is None: + return [] + if isinstance(ret, dict) and "gpu_data" in ret: + ret = ret["gpu_data"] + data = ret if isinstance(ret, list) else [ret] + built = self._build_amdsmi_sub_data(Topo, data) + return built if isinstance(built, list) else ([built] if built else []) + + def get_bad_pages(self) -> Optional[List[BadPages]]: + """Get bad pages from amd-smi bad-pages --json.""" + ret = self._run_amd_smi_dict(self.CMD_BAD_PAGES) + if ret is None: + return [] + data = ret if isinstance(ret, list) else [ret] + built = self._build_amdsmi_sub_data(BadPages, data) + return built if isinstance(built, list) else ([built] if built else []) + + def get_metric(self) -> Optional[List[AmdSmiMetric]]: + """Get metrics from amd-smi metric -g all --json.""" + ret = self._run_amd_smi_dict(self.CMD_METRIC) + if ret is None: + return [] + if isinstance(ret, dict) and "gpu_data" in ret: + ret = ret["gpu_data"] + data = ret if isinstance(ret, list) else [ret] + built = self._build_amdsmi_sub_data(AmdSmiMetric, data) + return built if isinstance(built, list) else ([built] if built else []) + + def get_xgmi_data( + self, + ) -> tuple[List[XgmiMetrics], List[XgmiLinks]]: + """Get XGMI metric and link data from amd-smi xgmi -m and xgmi -l.""" + xgmi_metric_raw = self._run_amd_smi_dict(self.CMD_XGMI_METRIC) + xgmi_metrics: Optional[List[XgmiMetrics]] = [] + if xgmi_metric_raw is not None: + if isinstance(xgmi_metric_raw, dict) and "xgmi_metric" in xgmi_metric_raw: + xgmi_metric_raw = xgmi_metric_raw["xgmi_metric"] + if isinstance(xgmi_metric_raw, list) and len(xgmi_metric_raw) == 1: + xgmi_metric_raw = xgmi_metric_raw[0] + data_m = ( + xgmi_metric_raw + if isinstance(xgmi_metric_raw, list) + else ([xgmi_metric_raw] if isinstance(xgmi_metric_raw, dict) else []) + ) + built_m = self._build_amdsmi_sub_data(XgmiMetrics, data_m) + xgmi_metrics = built_m if isinstance(built_m, list) else ([built_m] if built_m else []) + + xgmi_link_raw = self._run_amd_smi_dict(self.CMD_XGMI_LINK) + xgmi_links: Optional[List[XgmiLinks]] = [] + if isinstance(xgmi_link_raw, dict) and "link_status" in xgmi_link_raw: + link_list = xgmi_link_raw.get("link_status") + if isinstance(link_list, list): + xgmi_links = self._build_amdsmi_sub_data(XgmiLinks, link_list) + xgmi_links = xgmi_links if isinstance(xgmi_links, list) else [] + elif isinstance(xgmi_link_raw, list): + xgmi_links = self._build_amdsmi_sub_data(XgmiLinks, xgmi_link_raw) + xgmi_links = xgmi_links if isinstance(xgmi_links, list) else [] + + return xgmi_metrics or [], xgmi_links or [] + + def _get_amdsmi_data( + self, args: Optional[AmdSmiCollectorArgs] = None + ) -> Optional[AmdSmiDataModel]: """Fill in information for AmdSmi data model Returns: @@ -333,6 +458,10 @@ def _get_amdsmi_data(self) -> Optional[AmdSmiDataModel]: firmware = self.get_firmware() gpu_list = self.get_gpu_list() statics = self.get_static() + topology = self.get_topology() + metric = self.get_metric() + bad_pages = self.get_bad_pages() + xgmi_metric, xgmi_link = self.get_xgmi_data() cper_data, cper_afids = self.get_cper_data() except Exception as e: self._log_event( @@ -353,6 +482,11 @@ def _get_amdsmi_data(self) -> Optional[AmdSmiDataModel]: partition=partition, firmware=firmware, static=statics, + topology=topology or [], + metric=metric or [], + bad_pages=bad_pages or [], + xgmi_metric=xgmi_metric or [], + xgmi_link=xgmi_link or [], cper_data=cper_data, cper_afids=cper_afids, ) @@ -1348,7 +1482,7 @@ def collect_data( self.logger.info("amd-smi version: %s", version.version) self.logger.info("ROCm version: %s", version.rocm_version) - amd_smi_data = self._get_amdsmi_data() + amd_smi_data = self._get_amdsmi_data(args) if amd_smi_data is None: return self.result, None diff --git a/nodescraper/plugins/inband/amdsmi/amdsmidata.py b/nodescraper/plugins/inband/amdsmi/amdsmidata.py index d1278bc1..04ff545f 100644 --- a/nodescraper/plugins/inband/amdsmi/amdsmidata.py +++ b/nodescraper/plugins/inband/amdsmi/amdsmidata.py @@ -471,10 +471,19 @@ class PageData(BaseModel): value: Optional[int] +def _bad_pages_retired_list(v: object) -> list[PageData]: + """Coerce 'No bad pages found.' to empty list.""" + if v == "No bad pages found.": + return [] + return v # type: ignore[return-value] + + class BadPages(BaseModel): gpu: int retired: list[PageData] + _retired_validator = field_validator("retired", mode="before")(_bad_pages_retired_list) + # Metric Data class MetricUsage(BaseModel): @@ -653,6 +662,8 @@ class MetricThrottleVu(BaseModel): value: Optional[dict[str, list[Union[int, str]]]] = Field(deprecated=True, default=None) unit: str = Field(deprecated=True, default="") + _value_na = field_validator("value", mode="before")(na_to_none) + class MetricThrottle(AmdSmiBaseModel): accumulation_counter: Optional[Union[MetricThrottleVu, ValueUnit]] = None @@ -806,6 +817,7 @@ class LinkStatusTable(Enum): UP = "U" DOWN = "D" DISABLED = "X" + SELF = "SELF" class BiDirectionalTable(Enum): @@ -915,17 +927,6 @@ class Topo(BaseModel): links: list[TopoLink] -class AmdSmiTstData(BaseModel): - "Summary of amdsmitst results, with list and count of passing/skipped/failed tests" - - passed_tests: list[str] = Field(default_factory=list) - skipped_tests: list[str] = Field(default_factory=list) - failed_tests: list[str] = Field(default_factory=list) - passed_test_count: int = 0 - skipped_test_count: int = 0 - failed_test_count: int = 0 - - class AmdSmiDataModel(DataModel): """Data model for amd-smi data. @@ -955,7 +956,6 @@ class AmdSmiDataModel(DataModel): xgmi_link: Optional[list[XgmiLinks]] = Field(default_factory=list) cper_data: Optional[list[FileModel]] = Field(default_factory=list) cper_afids: dict[str, int] = Field(default_factory=dict) - amdsmitst_data: AmdSmiTstData = Field(default_factory=AmdSmiTstData) def get_list(self, gpu: int) -> Optional[AmdSmiListItem]: """Get the gpu list item for the given gpu id.""" diff --git a/nodescraper/plugins/inband/fabrics/fabrics_collector.py b/nodescraper/plugins/inband/fabrics/fabrics_collector.py index 26a846fe..59c884de 100644 --- a/nodescraper/plugins/inband/fabrics/fabrics_collector.py +++ b/nodescraper/plugins/inband/fabrics/fabrics_collector.py @@ -38,9 +38,7 @@ MstDevice, MstStatus, OfedInfo, - RdmaDevice, - RdmaInfo, - RdmaLink, + SlingshotData, ) @@ -54,8 +52,11 @@ class FabricsCollector(InBandDataCollector[FabricsDataModel, None]): CMD_OFED_INFO = "ofed_info -s" CMD_MST_START = "mst start" CMD_MST_STATUS = "mst status -v" - CMD_RDMA_DEV = "rdma dev" - CMD_RDMA_LINK = "rdma link" + CMD_CASSINI_PCI = "lspci | grep -i cassini" + CMD_NET_LINK = "ip link show" + CMD_LIBFABRIC_INFO = "fi_info -p cxi" + CMD_CXI_STAT = "cxi_stat" + CMD_CXI_MODULES = "lsmod | grep cxi" def _parse_ibstat(self, output: str) -> List[IbstatDevice]: """Parse 'ibstat' output into IbstatDevice objects. @@ -396,128 +397,6 @@ def _parse_mst_status(self, output: str) -> MstStatus: mst_status.devices = devices return mst_status - def _parse_rdma_dev(self, output: str) -> List[RdmaDevice]: - """Parse 'rdma dev' output into RdmaDevice objects. - - Args: - output: Raw output from 'rdma dev' command - - Returns: - List of RdmaDevice objects - """ - devices = [] - - for line in output.splitlines(): - line = line.strip() - if not line: - continue - - # Example InfiniBand format: 0: mlx5_0: node_type ca fw 16.28.2006 node_guid 0c42:a103:00b3:bfa0 sys_image_guid 0c42:a103:00b3:bfa0 - # Example RoCE format: 0: rocep9s0: node_type ca fw 1.117.1-a-63 node_guid 0690:81ff:fe4a:6c40 sys_image_guid 0690:81ff:fe4a:6c40 - parts = line.split() - if len(parts) < 2: - continue - - # First part might be index followed by colon - device_name = None - start_idx = 0 - - if parts[0].endswith(":"): - # Skip index (e.g., "0:") - start_idx = 1 - - if start_idx < len(parts): - device_name = parts[start_idx].rstrip(":") - start_idx += 1 - - if not device_name: - continue - - device = RdmaDevice(device=device_name) - - # Parse remaining attributes - i = start_idx - while i < len(parts): - if parts[i] == "node_type" and i + 1 < len(parts): - device.node_type = parts[i + 1] - i += 2 - elif parts[i] == "fw" and i + 1 < len(parts): - device.attributes["fw_version"] = parts[i + 1] - i += 2 - elif parts[i] == "node_guid" and i + 1 < len(parts): - device.node_guid = parts[i + 1] - i += 2 - elif parts[i] == "sys_image_guid" and i + 1 < len(parts): - device.sys_image_guid = parts[i + 1] - i += 2 - elif parts[i] == "state" and i + 1 < len(parts): - device.state = parts[i + 1] - i += 2 - else: - # Store as generic attribute - if i + 1 < len(parts) and not parts[i + 1].startswith("-"): - device.attributes[parts[i]] = parts[i + 1] - i += 2 - else: - i += 1 - - devices.append(device) - - return devices - - def _parse_rdma_link(self, output: str) -> List[RdmaLink]: - """Parse 'rdma link' output into RdmaLink objects. - - Args: - output: Raw output from 'rdma link' command - - Returns: - List of RdmaLink objects - """ - links = [] - - for line in output.splitlines(): - line = line.strip() - if not line: - continue - - # Example InfiniBand format: link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev ib0 - # Example RoCE format: link rocep9s0/1 state DOWN physical_state POLLING netdev benic8p1 - # Example alternate format: 0/1: mlx5_0/1: state ACTIVE physical_state LINK_UP - match = re.search(r"(\S+)/(\d+)", line) - if not match: - continue - - device_name = match.group(1) - port = int(match.group(2)) - - link = RdmaLink(device=device_name, port=port) - - # Parse remaining attributes - parts = line.split() - i = 0 - while i < len(parts): - if parts[i] == "state" and i + 1 < len(parts): - link.state = parts[i + 1] - i += 2 - elif parts[i] == "physical_state" and i + 1 < len(parts): - link.physical_state = parts[i + 1] - i += 2 - elif parts[i] == "netdev" and i + 1 < len(parts): - link.netdev = parts[i + 1] - i += 2 - else: - # Store as generic attribute if it's a key-value pair - if i + 1 < len(parts) and not parts[i + 1].startswith("-"): - link.attributes[parts[i]] = parts[i + 1] - i += 2 - else: - i += 1 - - links.append(link) - - return links - def collect_data( self, args=None, @@ -533,7 +412,7 @@ def collect_data( ibdev_netdev_mappings = [] ofed_info = None mst_status = None - rdma_info = None + slingshot_data = None # Collect ibstat information res_ibstat = self._run_sut_cmd(self.CMD_IBSTAT) @@ -650,49 +529,60 @@ def collect_data( priority=EventPriority.INFO, ) - # Collect RDMA device information - rdma_devices = [] - res_rdma_dev = self._run_sut_cmd(self.CMD_RDMA_DEV) - if res_rdma_dev.exit_code == 0: - rdma_devices = self._parse_rdma_dev(res_rdma_dev.stdout) - self._log_event( - category=EventCategory.NETWORK, - description=f"Collected {len(rdma_devices)} RDMA devices", - priority=EventPriority.INFO, - ) - else: - self._log_event( - category=EventCategory.NETWORK, - description="Error collecting RDMA device information", - data={"command": res_rdma_dev.command, "exit_code": res_rdma_dev.exit_code}, - priority=EventPriority.WARNING, - ) + # Slingshot fallback path: + # if no InfiniBand data was collected, probe for Cassini/CXI fabric. + ib_data_collected = bool(ibstat_devices or ibv_devices or ibdev_netdev_mappings) + if not ib_data_collected: + res_cassini = self._run_sut_cmd(self.CMD_CASSINI_PCI) + cassini_detected = res_cassini.exit_code == 0 and bool(res_cassini.stdout.strip()) - # Collect RDMA link information - rdma_links = [] - res_rdma_link = self._run_sut_cmd(self.CMD_RDMA_LINK) - if res_rdma_link.exit_code == 0: - rdma_links = self._parse_rdma_link(res_rdma_link.stdout) - self._log_event( - category=EventCategory.NETWORK, - description=f"Collected {len(rdma_links)} RDMA links", - priority=EventPriority.INFO, - ) - else: - self._log_event( - category=EventCategory.NETWORK, - description="Error collecting RDMA link information", - data={"command": res_rdma_link.command, "exit_code": res_rdma_link.exit_code}, - priority=EventPriority.WARNING, - ) + if cassini_detected: + self._log_event( + category=EventCategory.NETWORK, + description="Detected Slingshot/Cassini fabrics hardware", + priority=EventPriority.INFO, + ) - # Combine RDMA information - if rdma_devices or rdma_links: - rdma_info = RdmaInfo( - devices=rdma_devices, - links=rdma_links, - raw_output=res_rdma_dev.stdout + "\n" + res_rdma_link.stdout, - ) + res_net_link = self._run_sut_cmd(self.CMD_NET_LINK) + res_libfabric = self._run_sut_cmd(self.CMD_LIBFABRIC_INFO) + res_cxi_stat = self._run_sut_cmd(self.CMD_CXI_STAT) + res_cxi_modules = self._run_sut_cmd(self.CMD_CXI_MODULES) + + slingshot_data = SlingshotData( + cassini_pci=res_cassini.stdout, + net_link=res_net_link.stdout if res_net_link.exit_code == 0 else None, + libfabric_info=res_libfabric.stdout if res_libfabric.exit_code == 0 else None, + cxi_stat=res_cxi_stat.stdout if res_cxi_stat.exit_code == 0 else None, + cxi_modules=res_cxi_modules.stdout if res_cxi_modules.exit_code == 0 else None, + ) + + failed_cmds = [] + for cmd_name, cmd_res in ( + ("ip link show", res_net_link), + ("fi_info -p cxi", res_libfabric), + ("cxi_stat", res_cxi_stat), + ("lsmod | grep cxi", res_cxi_modules), + ): + if cmd_res.exit_code != 0: + failed_cmds.append(cmd_name) + + if failed_cmds: + self._log_event( + category=EventCategory.NETWORK, + description="Some Slingshot commands failed", + data={"failed_commands": failed_cmds}, + priority=EventPriority.WARNING, + ) + else: + self._log_event( + category=EventCategory.NETWORK, + description="No Slingshot/Cassini hardware detected on this system", + data={ + "command": res_cassini.command, + "exit_code": res_cassini.exit_code, + }, + priority=EventPriority.INFO, + ) # Build the data model only if we collected any data if ( @@ -701,7 +591,7 @@ def collect_data( or ibdev_netdev_mappings or ofed_info or mst_status - or rdma_info + or slingshot_data ): fabrics_data = FabricsDataModel( ibstat_devices=ibstat_devices, @@ -709,18 +599,20 @@ def collect_data( ibdev_netdev_mappings=ibdev_netdev_mappings, ofed_info=ofed_info, mst_status=mst_status, - rdma_info=rdma_info, + slingshot_data=slingshot_data, ) self.result.message = ( f"Collected fabrics data: {len(ibstat_devices)} ibstat devices, " f"{len(ibv_devices)} ibv devices, {len(ibdev_netdev_mappings)} mappings, " f"OFED: {ofed_info.version if ofed_info else 'N/A'}, " f"MST devices: {len(mst_status.devices) if mst_status else 0}, " - f"RDMA devices: {len(rdma_info.devices) if rdma_info else 0}" + f"Slingshot: {'detected' if slingshot_data else 'not detected'}" ) self.result.status = ExecutionStatus.OK return self.result, fabrics_data else: - self.result.message = "No InfiniBand/RDMA fabrics hardware detected on this system" + self.result.message = ( + "No InfiniBand/RDMA or Slingshot fabrics hardware detected on this system" + ) self.result.status = ExecutionStatus.NOT_RAN return self.result, None diff --git a/nodescraper/plugins/inband/fabrics/fabricsdata.py b/nodescraper/plugins/inband/fabrics/fabricsdata.py index 01061b3c..c74b0dd3 100644 --- a/nodescraper/plugins/inband/fabrics/fabricsdata.py +++ b/nodescraper/plugins/inband/fabrics/fabricsdata.py @@ -96,35 +96,14 @@ class MstStatus(BaseModel): raw_output: str = "" # Raw command output -class RdmaDevice(BaseModel): - """RDMA device information from rdma command""" +class SlingshotData(BaseModel): + """Slingshot/Cassini fabrics command outputs""" - device: str # Device name (e.g., "mlx5_0") - node_type: Optional[str] = None # Node type - transport: Optional[str] = None # Transport type - node_guid: Optional[str] = None # Node GUID - sys_image_guid: Optional[str] = None # System image GUID - state: Optional[str] = None # Device state - attributes: Dict[str, str] = Field(default_factory=dict) # Additional attributes - - -class RdmaLink(BaseModel): - """RDMA link information""" - - device: str # Device name - port: int # Port number - state: Optional[str] = None # Link state - physical_state: Optional[str] = None # Physical state - netdev: Optional[str] = None # Associated network device - attributes: Dict[str, str] = Field(default_factory=dict) # Additional attributes - - -class RdmaInfo(BaseModel): - """Complete RDMA information from rdma command""" - - devices: List[RdmaDevice] = Field(default_factory=list) # RDMA devices - links: List[RdmaLink] = Field(default_factory=list) # RDMA links - raw_output: str = "" # Raw command output + cassini_pci: str = "" # Output of lspci Cassini probe + net_link: Optional[str] = None # Output of ip link show + libfabric_info: Optional[str] = None # Output of fi_info -p cxi + cxi_stat: Optional[str] = None # Output of cxi_stat + cxi_modules: Optional[str] = None # Output of lsmod | grep cxi class FabricsDataModel(DataModel): @@ -137,4 +116,4 @@ class FabricsDataModel(DataModel): ) # ibdev2netdev output ofed_info: Optional[OfedInfo] = None # OFED version info mst_status: Optional[MstStatus] = None # MST status - rdma_info: Optional[RdmaInfo] = None # RDMA information + slingshot_data: Optional[SlingshotData] = None # Slingshot/Cassini command outputs diff --git a/nodescraper/plugins/inband/network/network_collector.py b/nodescraper/plugins/inband/network/network_collector.py index 4a87936a..d530bd98 100644 --- a/nodescraper/plugins/inband/network/network_collector.py +++ b/nodescraper/plugins/inband/network/network_collector.py @@ -27,30 +27,17 @@ from typing import Dict, List, Optional, Tuple from nodescraper.base import InBandDataCollector +from nodescraper.connection.inband import TextFileArtifact from nodescraper.enums import EventCategory, EventPriority, ExecutionStatus, OSFamily from nodescraper.models import TaskResult from .collector_args import NetworkCollectorArgs from .networkdata import ( - BroadcomNicDevice, - BroadcomNicQos, - BroadcomNicQosAppEntry, EthtoolInfo, IpAddress, Neighbor, NetworkDataModel, NetworkInterface, - PensandoNicCard, - PensandoNicDcqcn, - PensandoNicEnvironment, - PensandoNicPcieAts, - PensandoNicPort, - PensandoNicQos, - PensandoNicQosScheduling, - PensandoNicRdmaStatistic, - PensandoNicRdmaStatistics, - PensandoNicVersionFirmware, - PensandoNicVersionHostSoftware, Route, RoutingRule, ) @@ -65,6 +52,7 @@ class NetworkCollector(InBandDataCollector[NetworkDataModel, NetworkCollectorArg CMD_RULE = "ip rule show" CMD_NEIGHBOR = "ip neighbor show" CMD_ETHTOOL_TEMPLATE = "ethtool {interface}" + CMD_ETHTOOL_S_TEMPLATE = "ethtool -S {interface}" CMD_PING = "ping" CMD_WGET = "wget" CMD_CURL = "curl" @@ -73,21 +61,6 @@ class NetworkCollector(InBandDataCollector[NetworkDataModel, NetworkCollectorArg CMD_LLDPCLI_NEIGHBOR = "lldpcli show neighbor" CMD_LLDPCTL = "lldpctl" - # Broadcom NIC commands - CMD_NICCLI_LISTDEV = "niccli --list_devices" - CMD_NICCLI_GETQOS_TEMPLATE = "niccli --dev {device_num} qos --ets --show" - - # Pensando NIC commands - CMD_NICCTL_CARD = "nicctl show card" - CMD_NICCTL_DCQCN = "nicctl show dcqcn" - CMD_NICCTL_ENVIRONMENT = "nicctl show environment" - CMD_NICCTL_PCIE_ATS = "nicctl show pcie ats" - CMD_NICCTL_PORT = "nicctl show port" - CMD_NICCTL_QOS = "nicctl show qos" - CMD_NICCTL_RDMA_STATISTICS = "nicctl show rdma statistics" - CMD_NICCTL_VERSION_HOST_SOFTWARE = "nicctl show version host-software" - CMD_NICCTL_VERSION_FIRMWARE = "nicctl show version firmware" - def _parse_ip_addr(self, output: str) -> List[NetworkInterface]: """Parse 'ip addr show' output into NetworkInterface objects. @@ -468,919 +441,37 @@ def _parse_ethtool(self, interface: str, output: str) -> EthtoolInfo: return ethtool_info - def _parse_niccli_listdev(self, output: str) -> List[BroadcomNicDevice]: - """Parse 'niccli --list_devices' output into BroadcomNicDevice objects. - - Args: - output: Raw output from 'niccli --list_devices' command - - Returns: - List of BroadcomNicDevice objects - """ - devices = [] - current_device = None - - for line in output.splitlines(): - line_stripped = line.strip() - if not line_stripped: - continue - - # Check if this is a device header line - match = re.match(r"^(\d+)\s*\)\s*(.+?)(?:\s+\((.+?)\))?$", line_stripped) - if match: - device_num_str = match.group(1) - model = match.group(2).strip() if match.group(2) else None - adapter_port = match.group(3).strip() if match.group(3) else None - - try: - device_num = int(device_num_str) - except ValueError: - continue - - current_device = BroadcomNicDevice( - device_num=device_num, - model=model, - adapter_port=adapter_port, - ) - devices.append(current_device) - - # Check for Device Interface Name line - elif "Device Interface Name" in line and current_device: - parts = line_stripped.split(":") - if len(parts) >= 2: - current_device.interface_name = parts[1].strip() - - # Check for MAC Address line - elif "MAC Address" in line and current_device: - parts = line_stripped.split(":") - if len(parts) >= 2: - # MAC address has colons, so rejoin the parts after first split - mac = ":".join(parts[1:]).strip() - current_device.mac_address = mac - - # Check for PCI Address line - elif "PCI Address" in line and current_device: - parts = line_stripped.split(":") - if len(parts) >= 2: - # PCI address also has colons, rejoin - pci = ":".join(parts[1:]).strip() - current_device.pci_address = pci - - return devices - - def _parse_nicctl_card(self, output: str) -> List[PensandoNicCard]: - """Parse 'nicctl show card' output into PensandoNicCard objects. - - Args: - output: Raw output from 'nicctl show card' command - - Returns: - List of PensandoNicCard objects - """ - cards = [] - - # Skip header lines and separator lines - in_data_section = False - - for line in output.splitlines(): - line_stripped = line.strip() - if not line_stripped: - continue - - # Skip header line (starts with "Id") - if line_stripped.startswith("Id"): - in_data_section = True - continue - - # Skip separator lines (mostly dashes) - if re.match(r"^-+$", line_stripped): - continue - - # Parse data lines after header - if in_data_section: - # Split by whitespace - parts = line_stripped.split() - - # Expected format: Id PCIe_BDF ASIC F/W_partition Serial_number - if len(parts) >= 2: - card = PensandoNicCard( - id=parts[0], - pcie_bdf=parts[1], - asic=parts[2] if len(parts) > 2 else None, - fw_partition=parts[3] if len(parts) > 3 else None, - serial_number=parts[4] if len(parts) > 4 else None, - ) - cards.append(card) - - return cards - - def _parse_nicctl_dcqcn(self, output: str) -> List[PensandoNicDcqcn]: - """Parse 'nicctl show dcqcn' output into PensandoNicDcqcn objects. - - Args: - output: Raw output from 'nicctl show dcqcn' command - - Returns: - List of PensandoNicDcqcn objects - """ - dcqcn_entries = [] - current_entry = None - - for line in output.splitlines(): - line_stripped = line.strip() - if not line_stripped: - continue - - # Check for NIC line - if line_stripped.startswith("NIC :"): - # Save previous entry if exists - if current_entry: - dcqcn_entries.append(current_entry) - - # Parse NIC ID and PCIe BDF - # Format: "NIC : ()" - match = re.match( - r"NIC\s*:\s*([a-f0-9\-]+)\s*\(([0-9a-f:\.]+)\)", line_stripped, re.IGNORECASE - ) - if match: - nic_id = match.group(1) - pcie_bdf = match.group(2) - current_entry = PensandoNicDcqcn( - nic_id=nic_id, - pcie_bdf=pcie_bdf, - ) - continue - - # Skip separator lines (dashes or asterisks) - if re.match(r"^[-*]+$", line_stripped): - continue - - # Parse fields within current entry - if current_entry and ":" in line_stripped: - parts = line_stripped.split(":", 1) - if len(parts) == 2: - key = parts[0].strip() - value = parts[1].strip() - - if key == "Lif id": - current_entry.lif_id = value - elif key == "ROCE device": - current_entry.roce_device = value - elif key == "DCQCN profile id": - current_entry.dcqcn_profile_id = value - elif key == "Status": - current_entry.status = value - - # Add the last entry if exists - if current_entry: - dcqcn_entries.append(current_entry) - - return dcqcn_entries - - def _parse_nicctl_environment(self, output: str) -> List[PensandoNicEnvironment]: - """Parse 'nicctl show environment' output into PensandoNicEnvironment objects. - - Args: - output: Raw output from 'nicctl show environment' command - - Returns: - List of PensandoNicEnvironment objects - """ - environment_entries = [] - current_entry = None - - for line in output.splitlines(): - line_stripped = line.strip() - if not line_stripped: - continue - - # Check for NIC line - if line_stripped.startswith("NIC :"): - # Save previous entry if exists - if current_entry: - environment_entries.append(current_entry) - - # Parse NIC ID and PCIe BDF - # Format: "NIC : ()" - match = re.match( - r"NIC\s*:\s*([a-f0-9\-]+)\s*\(([0-9a-f:\.]+)\)", line_stripped, re.IGNORECASE - ) - if match: - nic_id = match.group(1) - pcie_bdf = match.group(2) - current_entry = PensandoNicEnvironment( - nic_id=nic_id, - pcie_bdf=pcie_bdf, - ) - continue - - # Skip separator lines (dashes) - if re.match(r"^-+$", line_stripped): - continue - - # Skip section headers (Power(W):, Temperature(C):, etc.) - if line_stripped.endswith("):"): - continue - - # Parse fields within current entry - if current_entry and ":" in line_stripped: - parts = line_stripped.split(":", 1) - if len(parts) == 2: - key = parts[0].strip() - value_str = parts[1].strip() - - # Try to parse the value as float - try: - value = float(value_str) - except ValueError: - continue - - # Map keys to fields - if key == "Total power drawn (pin)" or key == "Total power drawn": - current_entry.total_power_drawn = value - elif key == "Core power (pout1)" or key == "Core power": - current_entry.core_power = value - elif key == "ARM power (pout2)" or key == "ARM power": - current_entry.arm_power = value - elif key == "Local board temperature": - current_entry.local_board_temperature = value - elif key == "Die temperature": - current_entry.die_temperature = value - elif key == "Input voltage": - current_entry.input_voltage = value - elif key == "Core voltage": - current_entry.core_voltage = value - elif key == "Core frequency": - current_entry.core_frequency = value - elif key == "CPU frequency": - current_entry.cpu_frequency = value - elif key == "P4 stage frequency": - current_entry.p4_stage_frequency = value - - # Add the last entry if exists - if current_entry: - environment_entries.append(current_entry) - - return environment_entries - - def _parse_nicctl_pcie_ats(self, output: str) -> List[PensandoNicPcieAts]: - """Parse 'nicctl show pcie ats' output into PensandoNicPcieAts objects. - - Args: - output: Raw output from 'nicctl show pcie ats' command - - Returns: - List of PensandoNicPcieAts objects - """ - pcie_ats_entries = [] - - for line in output.splitlines(): - line_stripped = line.strip() - if not line_stripped: - continue - - # Parse line format: "NIC : () : " - if line_stripped.startswith("NIC :"): - match = re.match( - r"NIC\s*:\s*([a-f0-9\-]+)\s*\(([0-9a-f:\.]+)\)\s*:\s*(\w+)", - line_stripped, - re.IGNORECASE, - ) - if match: - nic_id = match.group(1) - pcie_bdf = match.group(2) - status = match.group(3) - entry = PensandoNicPcieAts( - nic_id=nic_id, - pcie_bdf=pcie_bdf, - status=status, - ) - pcie_ats_entries.append(entry) - - return pcie_ats_entries - - def _parse_nicctl_port(self, output: str) -> List[PensandoNicPort]: - """Parse 'nicctl show port' output into PensandoNicPort objects. + def _parse_ethtool_statistics(self, output: str, interface: str) -> Dict[str, str]: + """Parse 'ethtool -S ' output into a key-value dictionary. Args: - output: Raw output from 'nicctl show port' command + output: Raw output from 'ethtool -S ' command + interface: Name of the network interface (for netdev key) Returns: - List of PensandoNicPort objects + Dictionary of statistic name -> value (string) """ - port_entries = [] - current_entry = None - current_section = None # 'spec' or 'status' - current_nic_id = None - current_pcie_bdf = None - - for line in output.splitlines(): - line_stripped = line.strip() - if not line_stripped: - continue - - # Check for NIC line - if line_stripped.startswith("NIC") and ":" in line_stripped: - # Save previous entry if exists - if current_entry: - port_entries.append(current_entry) - current_entry = None - - # Parse NIC ID and PCIe BDF - match = re.match( - r"NIC\s*:\s*([a-f0-9\-]+)\s*\(([0-9a-f:\.]+)\)", line_stripped, re.IGNORECASE - ) - if match: - current_nic_id = match.group(1) - current_pcie_bdf = match.group(2) - continue - - # Check for Port line - if ( - line_stripped.startswith("Port") - and ":" in line_stripped - and current_nic_id - and current_pcie_bdf - ): - # Save previous entry if exists - if current_entry: - port_entries.append(current_entry) - - # Parse Port ID and Port name - match = re.match( - r"Port\s*:\s*([a-f0-9\-]+)\s*\(([^\)]+)\)", line_stripped, re.IGNORECASE - ) - if match: - port_id = match.group(1) - port_name = match.group(2) - current_entry = PensandoNicPort( - nic_id=current_nic_id, - pcie_bdf=current_pcie_bdf, - port_id=port_id, - port_name=port_name, - ) - continue - - # Skip separator lines (dashes) - if re.match(r"^-+$", line_stripped): - continue - - # Check for section headers - if line_stripped.endswith(":"): - if line_stripped == "Spec:": - current_section = "spec" - elif line_stripped == "Status:": - current_section = "status" - continue - - # Parse fields within current entry and section - if current_entry and current_section and ":" in line_stripped: - parts = line_stripped.split(":", 1) - if len(parts) == 2: - key = parts[0].strip() - value = parts[1].strip() - - if current_section == "spec": - if key == "Ifindex": - current_entry.spec_ifindex = value - elif key == "Type": - current_entry.spec_type = value - elif key == "speed": - current_entry.spec_speed = value - elif key == "Admin state": - current_entry.spec_admin_state = value - elif key == "FEC type": - current_entry.spec_fec_type = value - elif key == "Pause type": - current_entry.spec_pause_type = value - elif key == "Number of lanes": - try: - current_entry.spec_num_lanes = int(value) - except ValueError: - pass - elif key == "MTU": - try: - current_entry.spec_mtu = int(value) - except ValueError: - pass - elif key == "TX pause": - current_entry.spec_tx_pause = value - elif key == "RX pause": - current_entry.spec_rx_pause = value - elif key == "Auto negotiation": - current_entry.spec_auto_negotiation = value - elif current_section == "status": - if key == "Physical port": - try: - current_entry.status_physical_port = int(value) - except ValueError: - pass - elif key == "Operational status": - current_entry.status_operational_status = value - elif key == "Link FSM state": - current_entry.status_link_fsm_state = value - elif key == "FEC type": - current_entry.status_fec_type = value - elif key == "Cable type": - current_entry.status_cable_type = value - elif key == "Number of lanes": - try: - current_entry.status_num_lanes = int(value) - except ValueError: - pass - elif key == "speed": - current_entry.status_speed = value - elif key == "Auto negotiation": - current_entry.status_auto_negotiation = value - elif key == "MAC ID": - try: - current_entry.status_mac_id = int(value) - except ValueError: - pass - elif key == "MAC channel": - try: - current_entry.status_mac_channel = int(value) - except ValueError: - pass - elif key == "MAC address": - current_entry.status_mac_address = value - elif key == "Transceiver type": - current_entry.status_transceiver_type = value - elif key == "Transceiver state": - current_entry.status_transceiver_state = value - elif key == "Transceiver PID": - current_entry.status_transceiver_pid = value - - # Add the last entry if exists - if current_entry: - port_entries.append(current_entry) - - return port_entries - - def _parse_nicctl_qos(self, output: str) -> List[PensandoNicQos]: - """Parse 'nicctl show qos' output into PensandoNicQos objects. - - Args: - output: Raw output from 'nicctl show qos' command - - Returns: - List of PensandoNicQos objects - """ - qos_entries = [] - current_entry = None - current_nic_id = None - current_pcie_bdf = None - in_scheduling_table = False - - for line in output.splitlines(): - line_stripped = line.strip() - if not line_stripped: - continue - - # Check for NIC line: "NIC : 42424650-4c32-3533-3330-323934000000 (0000:06:00.0)" - if line_stripped.startswith("NIC") and ":" in line_stripped: - # Save previous entry if exists - if current_entry: - qos_entries.append(current_entry) - current_entry = None - - # Parse NIC ID and PCIe BDF - match = re.match( - r"NIC\s*:\s*([a-f0-9\-]+)\s*\(([0-9a-f:\.]+)\)", line_stripped, re.IGNORECASE - ) - if match: - current_nic_id = match.group(1) - current_pcie_bdf = match.group(2) - in_scheduling_table = False - continue - - # Check for Port line: "Port : 0490814a-6c40-4242-4242-000011010000" - if ( - line_stripped.startswith("Port") - and ":" in line_stripped - and current_nic_id - and current_pcie_bdf - ): - # Save previous entry if exists - if current_entry: - qos_entries.append(current_entry) - - # Parse Port ID - parts = line_stripped.split(":") - if len(parts) >= 2: - port_id = parts[1].strip() - current_entry = PensandoNicQos( - nic_id=current_nic_id, - pcie_bdf=current_pcie_bdf, - port_id=port_id, - ) - in_scheduling_table = False - continue - - # Skip separator lines (dashes) but don't reset scheduling table flag - if re.match(r"^-+$", line_stripped): - continue - - # Check for section headers - if current_entry: - # Classification type - if "Classification type" in line: - parts = line_stripped.split(":") - if len(parts) >= 2: - current_entry.classification_type = parts[1].strip() - - # DSCP bitmap - elif "DSCP bitmap" in line and "==>" in line: - parts = line_stripped.split("==>") - if len(parts) >= 2: - bitmap_part = parts[0].split(":") - if len(bitmap_part) >= 2: - current_entry.dscp_bitmap = bitmap_part[1].strip() - priority_part = parts[1].split(":") - if len(priority_part) >= 2: - try: - current_entry.dscp_priority = int(priority_part[1].strip()) - except ValueError: - pass - - # DSCP range - elif line_stripped.startswith("DSCP") and "==>" in line and "bitmap" not in line: - parts = line_stripped.split("==>") - if len(parts) >= 2: - dscp_part = parts[0].split(":") - if len(dscp_part) >= 2: - current_entry.dscp_range = dscp_part[1].strip() - priority_part = parts[1].split(":") - if len(priority_part) >= 2: - try: - current_entry.dscp_priority = int(priority_part[1].strip()) - except ValueError: - pass - - # PFC priority bitmap - elif "PFC priority bitmap" in line: - parts = line_stripped.split(":") - if len(parts) >= 2: - current_entry.pfc_priority_bitmap = parts[1].strip() - - # PFC no-drop priorities - elif "PFC no-drop priorities" in line: - parts = line_stripped.split(":") - if len(parts) >= 2: - current_entry.pfc_no_drop_priorities = parts[1].strip() - - # Scheduling table header - elif "Priority" in line and "Scheduling" in line: - in_scheduling_table = True - continue - - # Parse scheduling table entries - elif in_scheduling_table and not line_stripped.startswith("---"): - # Try to parse scheduling entry - # Format: "0 DWRR 0 N/A" - parts = line_stripped.split() - if len(parts) >= 2: - try: - priority = int(parts[0]) - scheduling_type = parts[1] if len(parts) > 1 else None - bandwidth = None - rate_limit = None - if len(parts) > 2: - try: - bandwidth = int(parts[2]) - except ValueError: - pass - if len(parts) > 3: - rate_limit = parts[3] - - sched_entry = PensandoNicQosScheduling( - priority=priority, - scheduling_type=scheduling_type, - bandwidth=bandwidth, - rate_limit=rate_limit, - ) - current_entry.scheduling.append(sched_entry) - except (ValueError, IndexError): - pass - - # Add the last entry if exists - if current_entry: - qos_entries.append(current_entry) - - return qos_entries - - def _parse_nicctl_rdma_statistics(self, output: str) -> List[PensandoNicRdmaStatistics]: - """Parse 'nicctl show rdma statistics' output into PensandoNicRdmaStatistics objects. - - Args: - output: Raw output from 'nicctl show rdma statistics' command - - Returns: - List of PensandoNicRdmaStatistics objects - """ - rdma_stats_entries = [] - current_entry = None - in_statistics_table = False - + stats_dict: Dict[str, str] = {} for line in output.splitlines(): - line_stripped = line.strip() - if not line_stripped: - continue - - # Check for NIC line: "NIC : 42424650-4c32-3533-3330-323934000000 (0000:06:00.0)" - if line_stripped.startswith("NIC") and ":" in line_stripped: - # Save previous entry if exists - if current_entry: - rdma_stats_entries.append(current_entry) - - # Parse NIC ID and PCIe BDF - match = re.match( - r"NIC\s*:\s*([a-f0-9\-]+)\s*\(([0-9a-f:\.]+)\)", line_stripped, re.IGNORECASE - ) - if match: - nic_id = match.group(1) - pcie_bdf = match.group(2) - current_entry = PensandoNicRdmaStatistics( - nic_id=nic_id, - pcie_bdf=pcie_bdf, - ) - in_statistics_table = False - continue - - # Skip separator lines (dashes) - if re.match(r"^-+$", line_stripped): - continue - - # Check for table header - if "Name" in line and "Count" in line: - in_statistics_table = True - continue - - # Parse statistics entries - if current_entry and in_statistics_table: - # The format is: "Queue pair create 1" - # We need to split from the right to get the count - parts = line_stripped.rsplit(None, 1) # Split from right, max 1 split - if len(parts) == 2: - name = parts[0].strip() - count_str = parts[1].strip() - try: - count = int(count_str) - stat_entry = PensandoNicRdmaStatistic( - name=name, - count=count, - ) - current_entry.statistics.append(stat_entry) - except ValueError: - pass - - # Add the last entry if exists - if current_entry: - rdma_stats_entries.append(current_entry) - - return rdma_stats_entries - - def _parse_nicctl_version_host_software( - self, output: str - ) -> Optional[PensandoNicVersionHostSoftware]: - """Parse 'nicctl show version host-software' output into PensandoNicVersionHostSoftware object. - - Args: - output: Raw output from 'nicctl show version host-software' command - - Returns: - PensandoNicVersionHostSoftware object or None if no data found - """ - version_info = PensandoNicVersionHostSoftware() - found_data = False - - for line in output.splitlines(): - line_stripped = line.strip() - if not line_stripped or ":" not in line_stripped: - continue - - # Split on the first colon to get key and value - parts = line_stripped.split(":", 1) - if len(parts) != 2: - continue - - key = parts[0].strip().lower() - value = parts[1].strip() - - if "nicctl" in key: - version_info.nicctl = value - found_data = True - elif "ipc driver" in key or "ipc_driver" in key: - version_info.ipc_driver = value - found_data = True - elif "ionic driver" in key or "ionic_driver" in key: - version_info.ionic_driver = value - found_data = True - - return version_info if found_data else None - - def _parse_nicctl_version_firmware(self, output: str) -> List[PensandoNicVersionFirmware]: - """Parse 'nicctl show version firmware' output into PensandoNicVersionFirmware objects. - - Args: - output: Raw output from 'nicctl show version firmware' command - - Returns: - List of PensandoNicVersionFirmware objects - """ - firmware_entries = [] - current_entry = None - - for line in output.splitlines(): - line_stripped = line.strip() - if not line_stripped: - continue - - # Skip separator lines (dashes) - if re.match(r"^-+$", line_stripped): - # Save previous entry when we hit a separator - if current_entry: - firmware_entries.append(current_entry) - current_entry = None - continue - - # Check for NIC line - if line_stripped.startswith("NIC") and ":" in line_stripped: - # Save previous entry if exists - if current_entry: - firmware_entries.append(current_entry) - - # Parse NIC ID and PCIe BDF - match = re.match( - r"NIC\s*:\s*([a-f0-9\-]+)\s*\(([0-9a-f:\.]+)\)", line_stripped, re.IGNORECASE - ) - if match: - nic_id = match.group(1) - pcie_bdf = match.group(2) - current_entry = PensandoNicVersionFirmware( - nic_id=nic_id, - pcie_bdf=pcie_bdf, - ) - continue - - # Parse version fields - if current_entry and ":" in line_stripped: - parts = line_stripped.split(":", 1) - if len(parts) == 2: - key = parts[0].strip().lower() - value = parts[1].strip() - - if "cpld" in key: - current_entry.cpld = value - elif "boot0" in key: - current_entry.boot0 = value - elif "uboot-a" in key or "uboot_a" in key: - current_entry.uboot_a = value - elif "firmware-a" in key or "firmware_a" in key: - current_entry.firmware_a = value - elif ( - "device config-a" in key - or "device_config_a" in key - or "device config" in key - ): - current_entry.device_config_a = value - - # Add the last entry if exists - if current_entry: - firmware_entries.append(current_entry) - - return firmware_entries - - def _parse_niccli_qos(self, device_num: int, output: str) -> BroadcomNicQos: - """Parse 'niccli --dev X qos --ets --show' output into BroadcomNicQos object. - - Args: - device_num: Device number - output: Raw output from 'niccli --dev X qos --ets --show' command - - Returns: - BroadcomNicQos object with parsed data - """ - qos_info = BroadcomNicQos(device_num=device_num, raw_output=output) - - current_app_entry = None - - for line in output.splitlines(): - line_stripped = line.strip() - if not line_stripped: - continue - - # Parse PRIO_MAP: "PRIO_MAP: 0:0 1:0 2:0 3:1 4:0 5:0 6:0 7:2" - if "PRIO_MAP:" in line: - parts = line.split("PRIO_MAP:") - if len(parts) >= 2: - prio_entries = parts[1].strip().split() - for entry in prio_entries: - if ":" in entry: - prio, tc = entry.split(":") - try: - qos_info.prio_map[int(prio)] = int(tc) - except ValueError: - pass - - # Parse TC Bandwidth: "TC Bandwidth: 50% 50% 0%" - elif "TC Bandwidth:" in line: - parts = line.split("TC Bandwidth:") - if len(parts) >= 2: - bandwidth_entries = parts[1].strip().split() - for bw in bandwidth_entries: - bw_clean = bw.rstrip("%") - try: - qos_info.tc_bandwidth.append(int(bw_clean)) - except ValueError: - pass - - # Parse TSA_MAP: "TSA_MAP: 0:ets 1:ets 2:strict" - elif "TSA_MAP:" in line: - parts = line.split("TSA_MAP:") - if len(parts) >= 2: - tsa_entries = parts[1].strip().split() - for entry in tsa_entries: - if ":" in entry: - tc, tsa = entry.split(":", 1) - try: - qos_info.tsa_map[int(tc)] = tsa - except ValueError: - pass - - # Parse PFC enabled: "PFC enabled: 3" - elif "PFC enabled:" in line: - parts = line.split("PFC enabled:") - if len(parts) >= 2: - try: - qos_info.pfc_enabled = int(parts[1].strip()) - except ValueError: - pass - - # Parse APP entries - detect start of new APP entry - elif line_stripped.startswith("APP#"): - # Save previous entry if exists - if current_app_entry: - qos_info.app_entries.append(current_app_entry) - current_app_entry = BroadcomNicQosAppEntry() - - # Parse Priority within APP entry - elif "Priority:" in line and current_app_entry is not None: - parts = line.split("Priority:") - if len(parts) >= 2: - try: - current_app_entry.priority = int(parts[1].strip()) - except ValueError: - pass - - # Parse Sel within APP entry - elif "Sel:" in line and current_app_entry is not None: - parts = line.split("Sel:") - if len(parts) >= 2: - try: - current_app_entry.sel = int(parts[1].strip()) - except ValueError: - pass - - # Parse DSCP within APP entry - elif "DSCP:" in line and current_app_entry is not None: - parts = line.split("DSCP:") - if len(parts) >= 2: - try: - current_app_entry.dscp = int(parts[1].strip()) - except ValueError: - pass - - # Parse protocol and port (e.g., "UDP or DCCP: 4791") - elif ( - "UDP" in line or "TCP" in line or "DCCP" in line - ) and current_app_entry is not None: - if ":" in line: - parts = line.split(":") - if len(parts) >= 2: - current_app_entry.protocol = parts[0].strip() - try: - current_app_entry.port = int(parts[1].strip()) - except ValueError: - pass - - # Parse TC Rate Limit: "TC Rate Limit: 100% 100% 100% 0% 0% 0% 0% 0%" - elif "TC Rate Limit:" in line: - parts = line.split("TC Rate Limit:") - if len(parts) >= 2: - rate_entries = parts[1].strip().split() - for rate in rate_entries: - rate_clean = rate.rstrip("%") - try: - qos_info.tc_rate_limit.append(int(rate_clean)) - except ValueError: - pass - - # Add the last APP entry if exists - if current_app_entry: - qos_info.app_entries.append(current_app_entry) - - return qos_info + if ":" not in line: + continue + if "NIC statistics" in line: + stats_dict["netdev"] = interface + elif "]: " in line and line.strip().startswith("["): + # Format: " [0]: rx_ucast_packets: 162" + bracket_part, rest = line.split("]: ", 1) + index = bracket_part.strip().lstrip("[") + if ": " in rest: + stat_key, stat_value = rest.split(": ", 1) + key = f"{index}_{stat_key.strip()}" + stats_dict[key] = stat_value.strip() + else: + key, value = line.split(":", 1) + stats_dict[key.strip()] = value.strip() + else: + key, value = line.split(":", 1) + stats_dict[key.strip()] = value.strip() + return stats_dict def _collect_ethtool_info(self, interfaces: List[NetworkInterface]) -> Dict[str, EthtoolInfo]: """Collect ethtool information for all network interfaces. @@ -1399,6 +490,19 @@ def _collect_ethtool_info(self, interfaces: List[NetworkInterface]) -> Dict[str, if res_ethtool.exit_code == 0: ethtool_info = self._parse_ethtool(iface.name, res_ethtool.stdout) + # Collect ethtool -S (statistics) for error/health analysis + cmd_s = self.CMD_ETHTOOL_S_TEMPLATE.format(interface=iface.name) + res_ethtool_s = self._run_sut_cmd(cmd_s, sudo=True) + if res_ethtool_s.exit_code == 0 and res_ethtool_s.stdout: + ethtool_info.statistics = self._parse_ethtool_statistics( + res_ethtool_s.stdout, iface.name + ) + self.result.artifacts.append( + TextFileArtifact( + filename=f"{iface.name}.log", + contents=res_ethtool_s.stdout, + ) + ) ethtool_data[iface.name] = ethtool_info self._log_event( category=EventCategory.NETWORK, @@ -1449,230 +553,6 @@ def _collect_lldp_info(self) -> None: priority=EventPriority.INFO, ) - def _collect_broadcom_nic_info( - self, - ) -> Tuple[List[BroadcomNicDevice], Dict[int, BroadcomNicQos]]: - """Collect Broadcom NIC information using niccli commands. - - Returns: - Tuple of (list of BroadcomNicDevice, dict mapping device number to BroadcomNicQos) - """ - devices = [] - qos_data = {} - - # First, list devices - res_listdev = self._run_sut_cmd(self.CMD_NICCLI_LISTDEV, sudo=True) - if res_listdev.exit_code == 0: - # Parse device list - devices = self._parse_niccli_listdev(res_listdev.stdout) - self._log_event( - category=EventCategory.NETWORK, - description=f"Collected Broadcom NIC device list: {len(devices)} devices", - priority=EventPriority.INFO, - ) - - # Collect QoS info for each device - for device in devices: - cmd = self.CMD_NICCLI_GETQOS_TEMPLATE.format(device_num=device.device_num) - res_qos = self._run_sut_cmd(cmd, sudo=True) - if res_qos.exit_code == 0: - qos_info = self._parse_niccli_qos(device.device_num, res_qos.stdout) - qos_data[device.device_num] = qos_info - self._log_event( - category=EventCategory.NETWORK, - description=f"Collected Broadcom NIC QoS info for device {device.device_num}", - priority=EventPriority.INFO, - ) - else: - self._log_event( - category=EventCategory.NETWORK, - description=f"Failed to collect QoS info for device {device.device_num}", - data={"command": res_qos.command, "exit_code": res_qos.exit_code}, - priority=EventPriority.WARNING, - ) - - if qos_data: - self._log_event( - category=EventCategory.NETWORK, - description=f"Collected Broadcom NIC QoS info for {len(qos_data)} devices", - priority=EventPriority.INFO, - ) - else: - self._log_event( - category=EventCategory.NETWORK, - description="Broadcom NIC collection failed or niccli not available", - data={"command": res_listdev.command, "exit_code": res_listdev.exit_code}, - priority=EventPriority.INFO, - ) - - return devices, qos_data - - def _collect_pensando_nic_info( - self, - ) -> Tuple[ - List[PensandoNicCard], - List[PensandoNicDcqcn], - List[PensandoNicEnvironment], - List[PensandoNicPcieAts], - List[PensandoNicPort], - List[PensandoNicQos], - List[PensandoNicRdmaStatistics], - Optional[PensandoNicVersionHostSoftware], - List[PensandoNicVersionFirmware], - List[str], - ]: - """Collect Pensando NIC information using nicctl commands. - - Returns: - Tuple of (list of PensandoNicCard, list of PensandoNicDcqcn, - list of PensandoNicEnvironment, list of PensandoNicPcieAts, - list of PensandoNicPort, list of PensandoNicQos, - list of PensandoNicRdmaStatistics, - PensandoNicVersionHostSoftware object, - list of PensandoNicVersionFirmware, - list of uncollected command names) - """ - cards = [] - dcqcn_entries = [] - environment_entries = [] - pcie_ats_entries = [] - port_entries = [] - qos_entries = [] - rdma_statistics_entries = [] - version_host_software = None - version_firmware_entries = [] - - # Track which commands failed - uncollected_commands = [] - - # Parse nicctl show card output - res_card = self._run_sut_cmd(self.CMD_NICCTL_CARD, sudo=True) - if res_card.exit_code == 0: - cards = self._parse_nicctl_card(res_card.stdout) - self._log_event( - category=EventCategory.NETWORK, - description=f"Collected Pensando NIC card list: {len(cards)} cards", - priority=EventPriority.INFO, - ) - else: - uncollected_commands.append(self.CMD_NICCTL_CARD) - - # Parse nicctl show dcqcn output - res_dcqcn = self._run_sut_cmd(self.CMD_NICCTL_DCQCN, sudo=True) - if res_dcqcn.exit_code == 0: - dcqcn_entries = self._parse_nicctl_dcqcn(res_dcqcn.stdout) - self._log_event( - category=EventCategory.NETWORK, - description=f"Collected Pensando NIC DCQCN info: {len(dcqcn_entries)} entries", - priority=EventPriority.INFO, - ) - else: - uncollected_commands.append(self.CMD_NICCTL_DCQCN) - - # Parse nicctl show environment output - res_environment = self._run_sut_cmd(self.CMD_NICCTL_ENVIRONMENT, sudo=True) - if res_environment.exit_code == 0: - environment_entries = self._parse_nicctl_environment(res_environment.stdout) - self._log_event( - category=EventCategory.NETWORK, - description=f"Collected Pensando NIC environment info: {len(environment_entries)} entries", - priority=EventPriority.INFO, - ) - else: - uncollected_commands.append(self.CMD_NICCTL_ENVIRONMENT) - - # Parse nicctl show pcie ats output - res_pcie_ats = self._run_sut_cmd(self.CMD_NICCTL_PCIE_ATS, sudo=True) - if res_pcie_ats.exit_code == 0: - pcie_ats_entries = self._parse_nicctl_pcie_ats(res_pcie_ats.stdout) - self._log_event( - category=EventCategory.NETWORK, - description=f"Collected Pensando NIC PCIe ATS info: {len(pcie_ats_entries)} entries", - priority=EventPriority.INFO, - ) - else: - uncollected_commands.append(self.CMD_NICCTL_PCIE_ATS) - - # Parse nicctl show port output - res_port = self._run_sut_cmd(self.CMD_NICCTL_PORT, sudo=True) - if res_port.exit_code == 0: - port_entries = self._parse_nicctl_port(res_port.stdout) - self._log_event( - category=EventCategory.NETWORK, - description=f"Collected Pensando NIC port info: {len(port_entries)} ports", - priority=EventPriority.INFO, - ) - else: - uncollected_commands.append(self.CMD_NICCTL_PORT) - - # Parse nicctl show qos output - res_qos = self._run_sut_cmd(self.CMD_NICCTL_QOS, sudo=True) - if res_qos.exit_code == 0: - qos_entries = self._parse_nicctl_qos(res_qos.stdout) - self._log_event( - category=EventCategory.NETWORK, - description=f"Collected Pensando NIC QoS info: {len(qos_entries)} entries", - priority=EventPriority.INFO, - ) - else: - uncollected_commands.append(self.CMD_NICCTL_QOS) - - # Parse nicctl show rdma statistics output - res_rdma_stats = self._run_sut_cmd(self.CMD_NICCTL_RDMA_STATISTICS, sudo=True) - if res_rdma_stats.exit_code == 0: - rdma_statistics_entries = self._parse_nicctl_rdma_statistics(res_rdma_stats.stdout) - self._log_event( - category=EventCategory.NETWORK, - description=f"Collected Pensando NIC RDMA statistics: {len(rdma_statistics_entries)} entries", - priority=EventPriority.INFO, - ) - else: - uncollected_commands.append(self.CMD_NICCTL_RDMA_STATISTICS) - - # Parse nicctl show version host-software output - res_version_host = self._run_sut_cmd(self.CMD_NICCTL_VERSION_HOST_SOFTWARE, sudo=True) - if res_version_host.exit_code == 0: - version_host_software = self._parse_nicctl_version_host_software( - res_version_host.stdout - ) - if version_host_software: - self._log_event( - category=EventCategory.NETWORK, - description="Collected Pensando NIC host software version", - priority=EventPriority.INFO, - ) - else: - uncollected_commands.append(self.CMD_NICCTL_VERSION_HOST_SOFTWARE) - else: - uncollected_commands.append(self.CMD_NICCTL_VERSION_HOST_SOFTWARE) - - # Parse nicctl show version firmware output - res_version_firmware = self._run_sut_cmd(self.CMD_NICCTL_VERSION_FIRMWARE, sudo=True) - if res_version_firmware.exit_code == 0: - version_firmware_entries = self._parse_nicctl_version_firmware( - res_version_firmware.stdout - ) - self._log_event( - category=EventCategory.NETWORK, - description=f"Collected Pensando NIC firmware versions: {len(version_firmware_entries)} entries", - priority=EventPriority.INFO, - ) - else: - uncollected_commands.append(self.CMD_NICCTL_VERSION_FIRMWARE) - - return ( - cards, - dcqcn_entries, - environment_entries, - pcie_ats_entries, - port_entries, - qos_entries, - rdma_statistics_entries, - version_host_software, - version_firmware_entries, - uncollected_commands, - ) - def _check_network_connectivity(self, cmd: str, url: str) -> bool: """Check network connectivity using specified command. @@ -1737,17 +617,6 @@ def collect_data( rules = [] neighbors = [] ethtool_data = {} - broadcom_devices: List[BroadcomNicDevice] = [] - broadcom_qos_data: Dict[int, BroadcomNicQos] = {} - pensando_cards: List[PensandoNicCard] = [] - pensando_dcqcn: List[PensandoNicDcqcn] = [] - pensando_environment: List[PensandoNicEnvironment] = [] - pensando_pcie_ats: List[PensandoNicPcieAts] = [] - pensando_ports: List[PensandoNicPort] = [] - pensando_qos: List[PensandoNicQos] = [] - pensando_rdma_statistics: List[PensandoNicRdmaStatistics] = [] - pensando_version_host_software: Optional[PensandoNicVersionHostSoftware] = None - pensando_version_firmware: List[PensandoNicVersionFirmware] = [] network_accessible: Optional[bool] = None # Check network connectivity if URL is provided @@ -1847,34 +716,7 @@ def collect_data( # Collect LLDP information self._collect_lldp_info() - # Collect Broadcom NIC information - broadcom_devices, broadcom_qos_data = self._collect_broadcom_nic_info() - - # Collect Pensando NIC information - ( - pensando_cards, - pensando_dcqcn, - pensando_environment, - pensando_pcie_ats, - pensando_ports, - pensando_qos, - pensando_rdma_statistics, - pensando_version_host_software, - pensando_version_firmware, - uncollected_commands, - ) = self._collect_pensando_nic_info() - - # Log summary of uncollected commands or success - if uncollected_commands: - self.result.message = "Network data collection failed" - self._log_event( - category=EventCategory.NETWORK, - description=f"Failed to collect {len(uncollected_commands)} nicctl commands: {', '.join(uncollected_commands)}", - priority=EventPriority.WARNING, - ) - - else: - self.result.message = "Network data collected successfully" + self.result.message = "Network data collected successfully" network_data = NetworkDataModel( interfaces=interfaces, @@ -1882,17 +724,6 @@ def collect_data( rules=rules, neighbors=neighbors, ethtool_info=ethtool_data, - broadcom_nic_devices=broadcom_devices, - broadcom_nic_qos=broadcom_qos_data, - pensando_nic_cards=pensando_cards, - pensando_nic_dcqcn=pensando_dcqcn, - pensando_nic_environment=pensando_environment, - pensando_nic_pcie_ats=pensando_pcie_ats, - pensando_nic_ports=pensando_ports, - pensando_nic_qos=pensando_qos, - pensando_nic_rdma_statistics=pensando_rdma_statistics, - pensando_nic_version_host_software=pensando_version_host_software, - pensando_nic_version_firmware=pensando_version_firmware, accessible=network_accessible, ) self.result.status = ExecutionStatus.OK diff --git a/nodescraper/plugins/inband/network/networkdata.py b/nodescraper/plugins/inband/network/networkdata.py index e6817514..20caaeca 100644 --- a/nodescraper/plugins/inband/network/networkdata.py +++ b/nodescraper/plugins/inband/network/networkdata.py @@ -103,195 +103,8 @@ class EthtoolInfo(BaseModel): port: Optional[str] = None # Port type (e.g., "Twisted Pair") auto_negotiation: Optional[str] = None # Auto-negotiation status (e.g., "on", "off") link_detected: Optional[str] = None # Link detection status (e.g., "yes", "no") - - -class BroadcomNicDevice(BaseModel): - """Broadcom NIC device information from niccli --list_devices""" - - device_num: int # Device number (1, 2, 3, etc.) - model: Optional[str] = None # e.g., "Broadcom BCM57608 1x400G QSFP-DD PCIe Ethernet NIC" - adapter_port: Optional[str] = None # e.g., "Adp#1 Port#1" - interface_name: Optional[str] = None # e.g., "benic1p1" - mac_address: Optional[str] = None # e.g., "8C:84:74:37:C3:70" - pci_address: Optional[str] = None # e.g., "0000:06:00.0" - - -class BroadcomNicQosAppEntry(BaseModel): - """APP TLV entry in Broadcom NIC QoS configuration""" - - priority: Optional[int] = None - sel: Optional[int] = None - dscp: Optional[int] = None - protocol: Optional[str] = None # "UDP or DCCP", etc. - port: Optional[int] = None - - -class BroadcomNicQos(BaseModel): - """Broadcom NIC QoS information from niccli --dev X qos --ets --show""" - - device_num: int # Device number this QoS info belongs to - raw_output: str # Raw command output - # ETS Configuration - prio_map: Dict[int, int] = Field( - default_factory=dict - ) # Priority to TC mapping {0: 0, 1: 0, ...} - tc_bandwidth: List[int] = Field( - default_factory=list - ) # TC bandwidth percentages [50, 50, 0, ...] - tsa_map: Dict[int, str] = Field( - default_factory=dict - ) # TC to TSA mapping {0: "ets", 1: "ets", ...} - # PFC Configuration - pfc_enabled: Optional[int] = None # Bitmap of PFC enabled priorities - # APP TLV entries - app_entries: List[BroadcomNicQosAppEntry] = Field(default_factory=list) - # TC Rate Limit - tc_rate_limit: List[int] = Field(default_factory=list) # TC rate limits [100, 100, 100, ...] - - -class PensandoNicCard(BaseModel): - """Pensando NIC card information from nicctl show card""" - - id: str # Card ID (UUID format) - pcie_bdf: str # PCIe Bus:Device.Function (e.g., "0000:06:00.0") - asic: Optional[str] = None # ASIC type (e.g., "salina") - fw_partition: Optional[str] = None # Firmware partition (e.g., "A") - serial_number: Optional[str] = None # Serial number (e.g., "FPL25330294") - - -class PensandoNicDcqcn(BaseModel): - """Pensando NIC DCQCN information from nicctl show dcqcn""" - - nic_id: str # NIC ID (UUID format) - pcie_bdf: str # PCIe Bus:Device.Function (e.g., "0000:06:00.0") - lif_id: Optional[str] = None # Lif ID (UUID format) - roce_device: Optional[str] = None # ROCE device name (e.g., "rocep9s0") - dcqcn_profile_id: Optional[str] = None # DCQCN profile id (e.g., "1") - status: Optional[str] = None # Status (e.g., "Disabled") - - -class PensandoNicEnvironment(BaseModel): - """Pensando NIC environment information from nicctl show environment""" - - nic_id: str # NIC ID (UUID format) - pcie_bdf: str # PCIe Bus:Device.Function (e.g., "0000:06:00.0") - # Power measurements in Watts - total_power_drawn: Optional[float] = None # Total power drawn (pin) - core_power: Optional[float] = None # Core power (pout1) - arm_power: Optional[float] = None # ARM power (pout2) - # Temperature measurements in Celsius - local_board_temperature: Optional[float] = None # Local board temperature - die_temperature: Optional[float] = None # Die temperature - # Voltage measurements in millivolts - input_voltage: Optional[float] = None # Input voltage - core_voltage: Optional[float] = None # Core voltage - # Frequency measurements in MHz - core_frequency: Optional[float] = None # Core frequency - cpu_frequency: Optional[float] = None # CPU frequency - p4_stage_frequency: Optional[float] = None # P4 stage frequency - - -class PensandoNicPcieAts(BaseModel): - """Pensando NIC PCIe ATS information from nicctl show pcie ats""" - - nic_id: str # NIC ID (UUID format) - pcie_bdf: str # PCIe Bus:Device.Function (e.g., "0000:06:00.0") - status: str # Status (e.g., "Disabled", "Enabled") - - -class PensandoNicPort(BaseModel): - """Pensando NIC port information from nicctl show port""" - - nic_id: str # NIC ID (UUID format) - pcie_bdf: str # PCIe Bus:Device.Function (e.g., "0000:06:00.0") - port_id: str # Port ID (UUID format) - port_name: str # Port name (e.g., "eth1/1") - # Spec fields - spec_ifindex: Optional[str] = None - spec_type: Optional[str] = None - spec_speed: Optional[str] = None - spec_admin_state: Optional[str] = None - spec_fec_type: Optional[str] = None - spec_pause_type: Optional[str] = None - spec_num_lanes: Optional[int] = None - spec_mtu: Optional[int] = None - spec_tx_pause: Optional[str] = None - spec_rx_pause: Optional[str] = None - spec_auto_negotiation: Optional[str] = None - # Status fields - status_physical_port: Optional[int] = None - status_operational_status: Optional[str] = None - status_link_fsm_state: Optional[str] = None - status_fec_type: Optional[str] = None - status_cable_type: Optional[str] = None - status_num_lanes: Optional[int] = None - status_speed: Optional[str] = None - status_auto_negotiation: Optional[str] = None - status_mac_id: Optional[int] = None - status_mac_channel: Optional[int] = None - status_mac_address: Optional[str] = None - status_transceiver_type: Optional[str] = None - status_transceiver_state: Optional[str] = None - status_transceiver_pid: Optional[str] = None - - -class PensandoNicQosScheduling(BaseModel): - """QoS Scheduling entry""" - - priority: int - scheduling_type: Optional[str] = None # e.g., "DWRR" - bandwidth: Optional[int] = None # Bandwidth in percentage - rate_limit: Optional[str] = None # Rate limit (e.g., "N/A" or value in Gbps) - - -class PensandoNicQos(BaseModel): - """Pensando NIC QoS information from nicctl show qos""" - - nic_id: str # NIC ID (UUID format) - pcie_bdf: str # PCIe Bus:Device.Function (e.g., "0000:06:00.0") - port_id: str # Port ID (UUID format) - classification_type: Optional[str] = None # e.g., "DSCP" - dscp_bitmap: Optional[str] = None # DSCP bitmap - dscp_range: Optional[str] = None # DSCP range (e.g., "0-63") - dscp_priority: Optional[int] = None # Priority mapped from DSCP - pfc_priority_bitmap: Optional[str] = None # PFC priority bitmap - pfc_no_drop_priorities: Optional[str] = None # PFC no-drop priorities - scheduling: List[PensandoNicQosScheduling] = Field(default_factory=list) # Scheduling entries - - -class PensandoNicRdmaStatistic(BaseModel): - """RDMA statistic entry""" - - name: str # Statistic name - count: int # Count value - - -class PensandoNicRdmaStatistics(BaseModel): - """Pensando NIC RDMA statistics from nicctl show rdma statistics""" - - nic_id: str # NIC ID (UUID format) - pcie_bdf: str # PCIe Bus:Device.Function (e.g., "0000:06:00.0") - statistics: List[PensandoNicRdmaStatistic] = Field(default_factory=list) # Statistics entries - - -class PensandoNicVersionHostSoftware(BaseModel): - """Pensando NIC host software version from nicctl show version host-software""" - - nicctl: Optional[str] = None # nicctl version - ipc_driver: Optional[str] = None # IPC driver version - ionic_driver: Optional[str] = None # ionic driver version - - -class PensandoNicVersionFirmware(BaseModel): - """Pensando NIC firmware version from nicctl show version firmware""" - - nic_id: str # NIC ID (UUID format) - pcie_bdf: str # PCIe Bus:Device.Function (e.g., "0000:06:00.0") - cpld: Optional[str] = None # CPLD version - boot0: Optional[str] = None # Boot0 version - uboot_a: Optional[str] = None # Uboot-A version - firmware_a: Optional[str] = None # Firmware-A version - device_config_a: Optional[str] = None # Device config-A version + # ethtool -S (statistics) output: parsed key-value for error/health analysis + statistics: Dict[str, str] = Field(default_factory=dict) class NetworkDataModel(DataModel): @@ -304,17 +117,4 @@ class NetworkDataModel(DataModel): ethtool_info: Dict[str, EthtoolInfo] = Field( default_factory=dict ) # Interface name -> EthtoolInfo mapping - broadcom_nic_devices: List[BroadcomNicDevice] = Field(default_factory=list) - broadcom_nic_qos: Dict[int, BroadcomNicQos] = Field( - default_factory=dict - ) # Device number -> QoS info mapping - pensando_nic_cards: List[PensandoNicCard] = Field(default_factory=list) - pensando_nic_dcqcn: List[PensandoNicDcqcn] = Field(default_factory=list) - pensando_nic_environment: List[PensandoNicEnvironment] = Field(default_factory=list) - pensando_nic_pcie_ats: List[PensandoNicPcieAts] = Field(default_factory=list) - pensando_nic_ports: List[PensandoNicPort] = Field(default_factory=list) - pensando_nic_qos: List[PensandoNicQos] = Field(default_factory=list) - pensando_nic_rdma_statistics: List[PensandoNicRdmaStatistics] = Field(default_factory=list) - pensando_nic_version_host_software: Optional[PensandoNicVersionHostSoftware] = None - pensando_nic_version_firmware: List[PensandoNicVersionFirmware] = Field(default_factory=list) accessible: Optional[bool] = None # Network accessibility check via ping diff --git a/nodescraper/plugins/inband/nic/__init__.py b/nodescraper/plugins/inband/nic/__init__.py new file mode 100644 index 00000000..187e5b40 --- /dev/null +++ b/nodescraper/plugins/inband/nic/__init__.py @@ -0,0 +1,28 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from .nic_plugin import NicPlugin + +__all__ = ["NicPlugin"] diff --git a/nodescraper/plugins/inband/nic/analyzer_args.py b/nodescraper/plugins/inband/nic/analyzer_args.py new file mode 100644 index 00000000..65214b76 --- /dev/null +++ b/nodescraper/plugins/inband/nic/analyzer_args.py @@ -0,0 +1,75 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from typing import Any, Dict, List, Optional + +from pydantic import Field + +from nodescraper.models import AnalyzerArgs + + +class NicAnalyzerArgs(AnalyzerArgs): + """Analyzer args for niccli/nicctl data, with expected_values keyed by canonical command key.""" + + expected_values: Optional[Dict[str, Dict[str, Any]]] = Field( + default=None, + description="Per-command expected checks keyed by canonical key (see command_to_canonical_key).", + ) + performance_profile_expected: str = Field( + default="RoCE", + description="Expected Broadcom performance_profile value (case-insensitive). Default RoCE.", + ) + support_rdma_disabled_values: List[str] = Field( + default_factory=lambda: ["0", "false", "disabled", "no", "off"], + description="Values that indicate RDMA is not supported (case-insensitive).", + ) + pcie_relaxed_ordering_expected: str = Field( + default="enabled", + description="Expected Broadcom pcie_relaxed_ordering value (e.g. 'Relaxed ordering = enabled'); checked case-insensitively. Default enabled.", + ) + # Expected QoS from niccli getqos (priorities, PFC, ETS) — applied across all adapters when set. + expected_qos_prio_map: Optional[Dict[Any, Any]] = Field( + default=None, + description="Expected priority-to-TC map (e.g. {0: 0, 1: 1}; keys may be int or str in config). Checked per device when set.", + ) + expected_qos_pfc_enabled: Optional[int] = Field( + default=None, + description="Expected PFC enabled value (0/1 or bitmask). Checked per device when set.", + ) + expected_qos_tsa_map: Optional[Dict[Any, Any]] = Field( + default=None, + description="Expected TSA map for ETS (e.g. {0: 'ets', 1: 'strict'}; keys may be int or str in config). Checked per device when set.", + ) + expected_qos_tc_bandwidth: Optional[List[int]] = Field( + default=None, + description="Expected TC bandwidth percentages. Checked per device when set.", + ) + require_qos_consistent_across_adapters: bool = Field( + default=True, + description="When True and no expected_qos_* are set, require all adapters to have the same prio_map, pfc_enabled, and tsa_map.", + ) + nicctl_log_error_regex: Optional[List[Dict[str, Any]]] = Field( + default=None, description="Optional list of error patterns for nicctl show card logs." + ) diff --git a/nodescraper/plugins/inband/nic/collector_args.py b/nodescraper/plugins/inband/nic/collector_args.py new file mode 100644 index 00000000..8085b632 --- /dev/null +++ b/nodescraper/plugins/inband/nic/collector_args.py @@ -0,0 +1,36 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from typing import List, Optional + +from nodescraper.models import CollectorArgs + + +class NicCollectorArgs(CollectorArgs): + """ """ + + commands: Optional[List[str]] = None + use_sudo_niccli: bool = True + use_sudo_nicctl: bool = True diff --git a/nodescraper/plugins/inband/nic/nic_analyzer.py b/nodescraper/plugins/inband/nic/nic_analyzer.py new file mode 100644 index 00000000..30543867 --- /dev/null +++ b/nodescraper/plugins/inband/nic/nic_analyzer.py @@ -0,0 +1,322 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### + +import re +from typing import Any, Dict, List, Optional + +from nodescraper.base.regexanalyzer import ErrorRegex +from nodescraper.enums import EventCategory, EventPriority, ExecutionStatus +from nodescraper.interfaces import DataAnalyzer +from nodescraper.models import TaskResult + +from .analyzer_args import NicAnalyzerArgs +from .nic_data import NicDataModel + +# Default regexes for nicctl show card logs (boot-fault, persistent, non-persistent) +DEFAULT_NICCTL_LOG_ERROR_REGEX: List[ErrorRegex] = [ + ErrorRegex( + regex=re.compile(r"\berror\b", re.IGNORECASE), + message="nicctl card log: error", + event_category=EventCategory.NETWORK, + event_priority=EventPriority.WARNING, + ), + ErrorRegex( + regex=re.compile(r"\bfail(?:ed|ure)?\b", re.IGNORECASE), + message="nicctl card log: fail/failed/failure", + event_category=EventCategory.NETWORK, + event_priority=EventPriority.WARNING, + ), + ErrorRegex( + regex=re.compile(r"\bfault\b", re.IGNORECASE), + message="nicctl card log: fault", + event_category=EventCategory.NETWORK, + event_priority=EventPriority.WARNING, + ), + ErrorRegex( + regex=re.compile(r"\bcritical\b", re.IGNORECASE), + message="nicctl card log: critical", + event_category=EventCategory.NETWORK, + event_priority=EventPriority.WARNING, + ), +] + + +def _nicctl_log_error_regex_list( + args: NicAnalyzerArgs, +) -> List[ErrorRegex]: + """Return list of ErrorRegex for nicctl card logs (from args or default).""" + if not args.nicctl_log_error_regex: + return list(DEFAULT_NICCTL_LOG_ERROR_REGEX) + out: List[ErrorRegex] = [] + for item in args.nicctl_log_error_regex: + if isinstance(item, ErrorRegex): + out.append(item) + elif isinstance(item, dict): + d = dict(item) + d["regex"] = re.compile(d["regex"]) if isinstance(d.get("regex"), str) else d["regex"] + if "event_category" in d and isinstance(d["event_category"], str): + d["event_category"] = EventCategory(d["event_category"]) + if "event_priority" in d: + p = d["event_priority"] + if isinstance(p, str): + d["event_priority"] = getattr(EventPriority, p.upper(), EventPriority.WARNING) + elif isinstance(p, int): + d["event_priority"] = EventPriority(p) + out.append(ErrorRegex(**d)) + return out + + +def _normalize_prio_map(d: Optional[Dict[Any, Any]]) -> Optional[Dict[int, int]]: + """Convert expected_qos_prio_map (config may have str keys) to Dict[int, int].""" + if d is None: + return None + return {int(k): int(v) for k, v in d.items()} + + +def _normalize_tsa_map(d: Optional[Dict[Any, Any]]) -> Optional[Dict[int, str]]: + """Convert expected_qos_tsa_map (config may have str keys) to Dict[int, str].""" + if d is None: + return None + return {int(k): str(v) for k, v in d.items()} + + +class NicAnalyzer(DataAnalyzer[NicDataModel, NicAnalyzerArgs]): + """Analyze niccli/nicctl data; checks Broadcom support_rdma, performance_profile (RoCE), pcie_relaxed_ordering (enabled), and getqos (expected QoS across adapters).""" + + DATA_MODEL = NicDataModel + + def analyze_data( + self, data: NicDataModel, args: Optional[NicAnalyzerArgs] = None + ) -> TaskResult: + """Run checks on the collected data (Broadcom support_rdma, performance_profile, pcie_relaxed_ordering, getqos per device).""" + if args is None: + args = NicAnalyzerArgs() + + has_broadcom = bool(data.broadcom_nic_support_rdma) + has_nicctl_logs = bool( + data.nicctl_card_logs and any((c or "").strip() for c in data.nicctl_card_logs.values()) + ) + if not has_broadcom and not has_nicctl_logs: + self.result.message = "No Broadcom support_rdma or nicctl card log data to check" + self.result.status = ExecutionStatus.OK + return self.result + + disabled_values = set(args.support_rdma_disabled_values) + any_disabled = False + for device_num, value in sorted(data.broadcom_nic_support_rdma.items()): + value_lower = (value or "").strip().lower() + if value_lower in disabled_values: + any_disabled = True + self._log_event( + category=EventCategory.NETWORK, + description=f"Broadcom device {device_num}: support_rdma is disabled or off", + data={"device_num": device_num, "support_rdma_output": value}, + priority=EventPriority.WARNING, + console_log=True, + ) + else: + self._log_event( + category=EventCategory.NETWORK, + description=f"Broadcom device {device_num}: support_rdma = {value!r}", + data={"device_num": device_num, "support_rdma_output": value}, + priority=EventPriority.INFO, + ) + + if any_disabled: + self.result.message = "One or more Broadcom devices have support_rdma disabled" + self.result.status = ExecutionStatus.WARNING + + # performance_profile expected value check (default RoCE) + expected_profile = args.performance_profile_expected.strip() + expected_profile_lower = expected_profile.lower() + any_non_roce = False + if data.broadcom_nic_performance_profile: + for device_num, value in sorted(data.broadcom_nic_performance_profile.items()): + value_normalized = (value or "").strip().lower() + if expected_profile_lower not in value_normalized: + any_non_roce = True + self._log_event( + category=EventCategory.NETWORK, + description=f"Broadcom device {device_num}: performance_profile is {value!r} (expected {expected_profile})", + data={"device_num": device_num, "performance_profile_output": value}, + priority=EventPriority.WARNING, + console_log=True, + ) + else: + self._log_event( + category=EventCategory.NETWORK, + description=f"Broadcom device {device_num}: performance_profile = {expected_profile}", + data={"device_num": device_num, "performance_profile_output": value}, + priority=EventPriority.INFO, + ) + + # pcie_relaxed_ordering check (default: output should indicate "enabled") + expected_ro = args.pcie_relaxed_ordering_expected.strip().lower() + any_relaxed_ordering_bad = False + if data.broadcom_nic_pcie_relaxed_ordering and expected_ro: + for device_num, value in sorted(data.broadcom_nic_pcie_relaxed_ordering.items()): + value_lower = (value or "").strip().lower() + if expected_ro not in value_lower: + any_relaxed_ordering_bad = True + self._log_event( + category=EventCategory.NETWORK, + description=f"Broadcom device {device_num}: pcie_relaxed_ordering does not show {args.pcie_relaxed_ordering_expected!r} (got {value!r})", + data={"device_num": device_num, "pcie_relaxed_ordering_output": value}, + priority=EventPriority.WARNING, + console_log=True, + ) + else: + self._log_event( + category=EventCategory.NETWORK, + description=f"Broadcom device {device_num}: pcie_relaxed_ordering = {args.pcie_relaxed_ordering_expected}", + data={"device_num": device_num, "pcie_relaxed_ordering_output": value}, + priority=EventPriority.INFO, + ) + + # getqos: expected QoS (priorities, PFC, ETS) across all adapters + any_qos_mismatch = False + expected_prio = _normalize_prio_map(args.expected_qos_prio_map) + expected_tsa = _normalize_tsa_map(args.expected_qos_tsa_map) + if ( + expected_prio is not None + or args.expected_qos_pfc_enabled is not None + or expected_tsa is not None + or args.expected_qos_tc_bandwidth is not None + ): + for device_num, qos in sorted(data.broadcom_nic_qos.items()): + mismatches = [] + if expected_prio is not None and qos.prio_map != expected_prio: + mismatches.append(f"prio_map {qos.prio_map!r} != expected {expected_prio!r}") + if ( + args.expected_qos_pfc_enabled is not None + and qos.pfc_enabled != args.expected_qos_pfc_enabled + ): + mismatches.append( + f"pfc_enabled {qos.pfc_enabled!r} != expected {args.expected_qos_pfc_enabled!r}" + ) + if expected_tsa is not None and qos.tsa_map != expected_tsa: + mismatches.append(f"tsa_map {qos.tsa_map!r} != expected {expected_tsa!r}") + if ( + args.expected_qos_tc_bandwidth is not None + and qos.tc_bandwidth != args.expected_qos_tc_bandwidth + ): + mismatches.append( + f"tc_bandwidth {qos.tc_bandwidth!r} != expected {args.expected_qos_tc_bandwidth!r}" + ) + if mismatches: + any_qos_mismatch = True + self._log_event( + category=EventCategory.NETWORK, + description=f"Broadcom device {device_num}: getqos does not match expected QoS: {'; '.join(mismatches)}", + data={ + "device_num": device_num, + "qos": qos.model_dump(), + "mismatches": mismatches, + }, + priority=EventPriority.WARNING, + console_log=True, + ) + else: + self._log_event( + category=EventCategory.NETWORK, + description=f"Broadcom device {device_num}: getqos matches expected (priorities, PFC, ETS)", + data={"device_num": device_num}, + priority=EventPriority.INFO, + ) + elif args.require_qos_consistent_across_adapters and len(data.broadcom_nic_qos) >= 2: + qos_list = list(data.broadcom_nic_qos.values()) + first = qos_list[0] + for device_num, qos in sorted(data.broadcom_nic_qos.items()): + if ( + qos.prio_map != first.prio_map + or qos.pfc_enabled != first.pfc_enabled + or qos.tsa_map != first.tsa_map + ): + any_qos_mismatch = True + self._log_event( + category=EventCategory.NETWORK, + description=f"Broadcom device {device_num}: getqos differs from other adapters (priorities, PFC, or ETS not consistent)", + data={"device_num": device_num, "qos": qos.model_dump()}, + priority=EventPriority.WARNING, + console_log=True, + ) + else: + self._log_event( + category=EventCategory.NETWORK, + description=f"Broadcom device {device_num}: getqos consistent with other adapters", + data={"device_num": device_num}, + priority=EventPriority.INFO, + ) + + # nicctl card logs (boot-fault, persistent, non-persistent): run error regexes and log matches to user. + any_nicctl_log_errors = False + if data.nicctl_card_logs: + regex_list = _nicctl_log_error_regex_list(args) + for log_type, content in data.nicctl_card_logs.items(): + if not (content or "").strip(): + continue + for err_regex in regex_list: + for match in err_regex.regex.finditer(content): + matched_text = match.group(0).strip() or match.group(0) + if len(matched_text) > 500: + matched_text = matched_text[:497] + "..." + any_nicctl_log_errors = True + self._log_event( + category=err_regex.event_category, + description=f"nicctl card log ({log_type}): {err_regex.message} — {matched_text!r}", + data={ + "log_type": log_type, + "message": err_regex.message, + "match_content": matched_text, + }, + priority=err_regex.event_priority, + console_log=True, + ) + + if ( + any_disabled + or any_non_roce + or any_relaxed_ordering_bad + or any_qos_mismatch + or any_nicctl_log_errors + ): + self.result.status = ExecutionStatus.WARNING + parts = [] + if any_disabled: + parts.append("support_rdma") + if any_non_roce: + parts.append("performance_profile") + if any_relaxed_ordering_bad: + parts.append("pcie_relaxed_ordering") + if any_qos_mismatch: + parts.append("getqos") + if any_nicctl_log_errors: + parts.append("nicctl_card_logs") + self.result.message = f"Broadcom/nic check(s) failed: {' and/or '.join(parts)}" + else: + self.result.status = ExecutionStatus.OK + self.result.message = "Broadcom support_rdma, performance_profile, pcie_relaxed_ordering, getqos, and nicctl card logs checks OK" + return self.result diff --git a/nodescraper/plugins/inband/nic/nic_collector.py b/nodescraper/plugins/inband/nic/nic_collector.py new file mode 100644 index 00000000..fd38939a --- /dev/null +++ b/nodescraper/plugins/inband/nic/nic_collector.py @@ -0,0 +1,1312 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +import json +import re +from typing import Any, Dict, List, Optional, Tuple + +from nodescraper.base import InBandDataCollector +from nodescraper.connection.inband import TextFileArtifact +from nodescraper.enums import EventCategory, EventPriority, ExecutionStatus +from nodescraper.models import TaskResult +from nodescraper.utils import ( + command_result_event_data, + has_command_error_output, +) + +from .collector_args import NicCollectorArgs +from .nic_data import ( + NicCliDevice, + NicCliQos, + NicCliQosAppEntry, + NicCommandResult, + NicCtlCard, + NicCtlCardShow, + NicCtlDcqcn, + NicCtlEnvironment, + NicCtlLif, + NicCtlPort, + NicCtlQos, + NicCtlRdma, + NicCtlVersion, + NicDataModel, + PensandoNicCard, + PensandoNicDcqcn, + PensandoNicEnvironment, + PensandoNicLif, + PensandoNicPcieAts, + PensandoNicPort, + PensandoNicQos, + PensandoNicQosScheduling, + PensandoNicRdmaStatistic, + PensandoNicRdmaStatistics, + PensandoNicVersionFirmware, + PensandoNicVersionHostSoftware, + command_to_canonical_key, +) + +# Default commands: niccli (Broadcom) and nicctl (Pensando). Use {device_num} and {card_id} placeholders. +NICCLI_VERSION_CMD = "niccli --version" +NICCLI_VERSION_LEGACY_MAX = 233 # Commands below use -dev/-getoption/getqos; for version > this use --dev/--getoption/qos --ets --show +NICCLI_LIST_CMD = "niccli --list" +NICCLI_LIST_DEVICES_CMD = "niccli --list_devices" # new (> v233) +NICCLI_LIST_DEVICES_CMD_LEGACY = "niccli --listdev" # legacy (<= v233) +NICCLI_DISCOVERY_CMDS_LEGACY = [ + NICCLI_LIST_DEVICES_CMD_LEGACY, + NICCLI_LIST_CMD, +] +NICCLI_DISCOVERY_CMDS_NEW = [ + NICCLI_LIST_DEVICES_CMD, + NICCLI_LIST_CMD, +] +# All discovery command variants (for canonical key); default list for backward compat = legacy +NICCLI_DISCOVERY_CMDS = NICCLI_DISCOVERY_CMDS_LEGACY +NICCLI_DISCOVERY_CMDS_ALL = frozenset( + [NICCLI_LIST_DEVICES_CMD_LEGACY, NICCLI_LIST_DEVICES_CMD, NICCLI_LIST_CMD] +) +# Legacy (<= v233): single-dash options and getqos +NICCLI_SUPPORT_RDMA_CMD_TEMPLATE_LEGACY = ( + "niccli -dev {device_num} nvm -getoption support_rdma -scope 0" +) +NICCLI_PERFORMANCE_PROFILE_CMD_TEMPLATE_LEGACY = ( + "niccli -dev {device_num} nvm -getoption performance_profile" +) +NICCLI_PCIE_RELAXED_ORDERING_CMD_TEMPLATE_LEGACY = ( + "niccli -dev {device_num} nvm -getoption pcie_relaxed_ordering" +) +NICCLI_QOS_CMD_TEMPLATE_LEGACY = "niccli -dev {device_num} getqos" +NICCLI_PER_DEVICE_TEMPLATES_LEGACY = [ + NICCLI_SUPPORT_RDMA_CMD_TEMPLATE_LEGACY, + NICCLI_PERFORMANCE_PROFILE_CMD_TEMPLATE_LEGACY, + NICCLI_PCIE_RELAXED_ORDERING_CMD_TEMPLATE_LEGACY, + NICCLI_QOS_CMD_TEMPLATE_LEGACY, +] +# New (> v233): double-dash options and qos --ets --show +NICCLI_SUPPORT_RDMA_CMD_TEMPLATE_NEW = "niccli --dev {device_num} nvm --getoption support_rdma" +NICCLI_PERFORMANCE_PROFILE_CMD_TEMPLATE_NEW = ( + "niccli --dev {device_num} nvm --getoption performance_profile" +) +NICCLI_PCIE_RELAXED_ORDERING_CMD_TEMPLATE_NEW = ( + "niccli --dev {device_num} nvm --getoption pcie_relaxed_ordering" +) +NICCLI_QOS_CMD_TEMPLATE_NEW = "niccli --dev {device_num} qos --ets --show" +NICCLI_PER_DEVICE_TEMPLATES_NEW = [ + NICCLI_SUPPORT_RDMA_CMD_TEMPLATE_NEW, + NICCLI_PERFORMANCE_PROFILE_CMD_TEMPLATE_NEW, + NICCLI_PCIE_RELAXED_ORDERING_CMD_TEMPLATE_NEW, + NICCLI_QOS_CMD_TEMPLATE_NEW, +] +# Backward compatibility: default to legacy templates (used by _default_commands and any code that imports these) +NICCLI_SUPPORT_RDMA_CMD_TEMPLATE = NICCLI_SUPPORT_RDMA_CMD_TEMPLATE_LEGACY +NICCLI_PERFORMANCE_PROFILE_CMD_TEMPLATE = NICCLI_PERFORMANCE_PROFILE_CMD_TEMPLATE_LEGACY +NICCLI_PCIE_RELAXED_ORDERING_CMD_TEMPLATE = NICCLI_PCIE_RELAXED_ORDERING_CMD_TEMPLATE_LEGACY +NICCLI_PER_DEVICE_TEMPLATES = NICCLI_PER_DEVICE_TEMPLATES_LEGACY +# Text-format command for card discovery and pensando_nic_cards (no --json). +NICCTL_CARD_TEXT_CMD = "nicctl show card" +NICCTL_GLOBAL_COMMANDS = [ + "nicctl --version", + "nicctl show card flash partition --json", + "nicctl show card interrupts --json", + "nicctl show card logs --non-persistent", + "nicctl show card logs --boot-fault", + "nicctl show card logs --persistent", + "nicctl show card profile --json", + "nicctl show card time --json", + "nicctl show card statistics packet-buffer summary --json", + "nicctl show lif statistics --json", + "nicctl show lif internal queue-to-ud-pinning", + "nicctl show pipeline internal anomalies", + "nicctl show pipeline internal rsq-ring", + "nicctl show pipeline internal statistics memory", + "nicctl show port fsm", + "nicctl show port transceiver --json", + "nicctl show port statistics --json", + "nicctl show port internal mac", + "nicctl show qos headroom --json", + "nicctl show rdma queue --json", + "nicctl show rdma queue-pair --detail --json", + "nicctl show version firmware", +] +NICCTL_PER_CARD_TEMPLATES = [ + "nicctl show dcqcn --card {card_id} --json", + "nicctl show card hardware-config --card {card_id}", +] + +# Legacy text-format commands for Pensando (no --json); parsed by _parse_nicctl_* into pensando_nic_*. +NICCTL_LEGACY_TEXT_COMMANDS = [ + "nicctl show card", + "nicctl show dcqcn", + "nicctl show environment", + "nicctl show lif", + "nicctl show pcie ats", + "nicctl show port", + "nicctl show qos", + "nicctl show rdma statistics", + "nicctl show version host-software", +] + +# Max lengths for fields included in the serialized datamodel (keeps nicclidatamodel.json small). +MAX_COMMAND_LENGTH_IN_DATAMODEL = 256 +MAX_STDERR_LENGTH_IN_DATAMODEL = 512 + + +def _parse_niccli_version(stdout: str) -> Optional[int]: + """Parse niccli version number from 'niccli --version' output. + Handles formats like 'niccli v233', 'v233', 'version 233', '233'. + Returns None if version cannot be parsed. + """ + if not stdout or not stdout.strip(): + return None + # Match v233, v 233, version 233, niccli 233, etc. + match = re.search(r"v?\s*(\d+)|version\s+(\d+)|\b(\d{2,})\b", stdout.strip(), re.I) + if match: + for g in match.groups(): + if g is not None: + return int(g) + return None + + +def _get_niccli_per_device_templates(version: Optional[int]) -> List[str]: + """Return the per-device command templates for the given niccli version. + For version > NICCLI_VERSION_LEGACY_MAX (233) use new syntax (--dev, --getoption, qos --ets --show). + Otherwise use legacy syntax (-dev, -getoption, getqos). If version is None, default to legacy. + """ + if version is not None and version > NICCLI_VERSION_LEGACY_MAX: + return NICCLI_PER_DEVICE_TEMPLATES_NEW.copy() + return NICCLI_PER_DEVICE_TEMPLATES_LEGACY.copy() + + +def _get_niccli_discovery_commands(version: Optional[int]) -> List[str]: + """Return the discovery commands for the given niccli version. + Legacy (<= v233) uses --listdev; new (> v233) uses --list_devices. If version is None, default to legacy. + """ + if version is not None and version > NICCLI_VERSION_LEGACY_MAX: + return NICCLI_DISCOVERY_CMDS_NEW.copy() + return NICCLI_DISCOVERY_CMDS_LEGACY.copy() + + +# Commands whose output is very long; store only as file artifacts, not in data model. +def _is_artifact_only_command(cmd: str) -> bool: + c = cmd.strip() + if c.startswith("nicctl show card logs "): + return True + if "nicctl show card hardware-config --card " in c: + return True + if c == "nicctl show port fsm": + return True + if c.startswith("nicctl show pipeline internal "): + return True + if c == "nicctl show rdma queue-pair --detail --json": + return True + if c == "nicctl show lif internal queue-to-ud-pinning": + return True + if c == "nicctl show port internal mac": + return True + return False + + +def _merged_canonical_key(cmd: str) -> str: + """Return a single canonical key for commands that collect the same data.""" + if cmd in NICCLI_DISCOVERY_CMDS_ALL: + return "niccli_discovery" + return command_to_canonical_key(cmd) + + +def _default_commands() -> List[str]: + """Return the default flat list of command templates (with placeholders).""" + out: List[str] = [NICCLI_LIST_CMD] + for t in NICCLI_PER_DEVICE_TEMPLATES: + out.append(t) + for c in NICCTL_GLOBAL_COMMANDS: + out.append(c) + for t in NICCTL_PER_CARD_TEMPLATES: + out.append(t) + return out + + +def _parse_niccli_qos_app_entries(stdout: str) -> List[NicCliQosAppEntry]: + """Parse APP# blocks from niccli qos output into NicCliQosAppEntry list.""" + entries: List[NicCliQosAppEntry] = [] + current: Optional[NicCliQosAppEntry] = None + for line in stdout.splitlines(): + line = line.strip() + if re.match(r"APP#\d+", line, re.I): + if current is not None: + entries.append(current) + current = NicCliQosAppEntry() + continue + if current is None or ":" not in line: + continue + key, _, val = line.partition(":") + key, val = key.strip().lower(), val.strip() + if "priority" in key: + try: + current.priority = int(val) + except ValueError: + pass + elif key == "sel": + try: + current.sel = int(val) + except ValueError: + pass + elif key == "dscp": + try: + current.dscp = int(val) + except ValueError: + pass + elif key == "port": + try: + current.port = int(val) + except ValueError: + pass + elif ( + key in ("tcp", "udp", "dccp") + or "protocol" in key + or "udp" in key + or "tcp" in key + or "dccp" in key + ): + if val and not val.isdigit(): + current.protocol = val + else: + current.protocol = { + "udp or dccp": "UDP or DCCP", + "tcp": "TCP", + "udp": "UDP", + "dccp": "DCCP", + }.get(key, key.replace("_", " ").title() if val.isdigit() else val) + if val: + try: + current.port = int(val) + except ValueError: + pass + if current is not None: + entries.append(current) + return entries + + +def _parse_niccli_device_numbers(stdout: str) -> List[int]: + """Parse device numbers from niccli --list or --list_devices output. + Looks for lines like '1) Model' or '1 )' to extract device index. + """ + device_nums: List[int] = [] + for line in stdout.splitlines(): + line = line.strip() + if not line: + continue + match = re.match(r"^(\d+)\s*\)", line) + if match: + try: + device_nums.append(int(match.group(1))) + except ValueError: + continue + return sorted(set(device_nums)) + + +def _parse_nicctl_card_ids(stdout: str) -> List[str]: + """Parse card IDs from nicctl show card --json output. + Expects JSON: either a list of objects with 'id'/'card_id' or an object with a list. + """ + try: + data = json.loads(stdout) + except json.JSONDecodeError: + return [] + ids: List[str] = [] + if isinstance(data, list): + for item in data: + if isinstance(item, dict): + cid = item.get("id") or item.get("card_id") or item.get("CardId") + if cid is not None: + ids.append(str(cid)) + elif isinstance(data, dict): + cards = data.get("cards") or data.get("Cards") or data.get("card") or data.get("data") + if isinstance(cards, list): + for item in cards: + if isinstance(item, dict): + cid = item.get("id") or item.get("card_id") or item.get("CardId") + if cid is not None: + ids.append(str(cid)) + cid = data.get("id") or data.get("card_id") + if cid is not None and str(cid) not in ids: + ids.append(str(cid)) + return ids + + +def _card_list_items(data: Any) -> List[Any]: + """Return list of card item dicts from parsed nicctl show card --json.""" + if data is None: + return [] + if isinstance(data, list): + return [x for x in data if isinstance(x, dict)] + if isinstance(data, dict): + cards = data.get("cards") or data.get("Cards") or data.get("card") or data.get("data") + if isinstance(cards, list): + return [x for x in cards if isinstance(x, dict)] + return [] + + +def _find_card_info(card_list: List[Any], card_id: str) -> Optional[Any]: + """Return the card item dict whose id/card_id matches card_id.""" + for item in card_list: + cid = item.get("id") or item.get("card_id") or item.get("CardId") + if cid is not None and str(cid) == str(card_id): + return item + return None + + +def _build_structured( + results: Dict[str, NicCommandResult], + parsed: Dict[str, Any], + card_ids: List[str], + card_list_override: Optional[List[Dict[str, Any]]] = None, +) -> Tuple[ + Optional[NicCtlCardShow], + List[NicCtlCard], + Optional[NicCtlPort], + Optional[NicCtlLif], + Optional[NicCtlQos], + Optional[NicCtlRdma], + Optional[NicCtlDcqcn], + Optional[NicCtlEnvironment], + Optional[NicCtlVersion], +]: + """Build structured domain objects from results and parsed dicts.""" + + def _p(cmd: str) -> Any: + return parsed.get(cmd) + + def _r(cmd: str) -> Optional[NicCommandResult]: + return results.get(cmd) + + def _stdout(cmd: str) -> str: + r = _r(cmd) + return (r.stdout or "") if r else "" + + card_list = ( + card_list_override + if card_list_override is not None + else _card_list_items(_p("nicctl show card --json")) + ) + cards: List[NicCtlCard] = [] + for cid in card_ids: + info = _find_card_info(card_list, cid) + hw_cmd = f"nicctl show card hardware-config --card {cid}" + dcqcn_cmd = f"nicctl show dcqcn --card {cid} --json" + cards.append( + NicCtlCard( + card_id=cid, + info=info, + hardware_config=_stdout(hw_cmd) or None, + dcqcn=_p(dcqcn_cmd), + ) + ) + + card_show = NicCtlCardShow( + flash_partition=_p("nicctl show card flash partition --json"), + interrupts=_p("nicctl show card interrupts --json"), + logs_non_persistent=_stdout("nicctl show card logs --non-persistent") or None, + logs_boot_fault=_stdout("nicctl show card logs --boot-fault") or None, + logs_persistent=_stdout("nicctl show card logs --persistent") or None, + profile=_p("nicctl show card profile --json"), + time=_p("nicctl show card time --json"), + statistics_packet_buffer_summary=_p( + "nicctl show card statistics packet-buffer summary --json" + ), + ) + + port = NicCtlPort( + port=_p("nicctl show port"), + port_fsm=_stdout("nicctl show port fsm") or None, + port_transceiver=_p("nicctl show port transceiver --json"), + port_statistics=_p("nicctl show port statistics --json"), + port_internal_mac=_stdout("nicctl show port internal mac") or None, + ) + lif = NicCtlLif( + lif=_p("nicctl show lif"), + lif_statistics=_p("nicctl show lif statistics --json"), + lif_internal_queue_to_ud_pinning=_stdout("nicctl show lif internal queue-to-ud-pinning") + or None, + ) + qos = NicCtlQos( + qos=_p("nicctl show qos"), + qos_headroom=_p("nicctl show qos headroom --json"), + ) + rdma = NicCtlRdma( + rdma_queue=_p("nicctl show rdma queue --json"), + rdma_queue_pair_detail=_p("nicctl show rdma queue-pair --detail --json"), + rdma_statistics=_p("nicctl show rdma statistics"), + ) + dcqcn = NicCtlDcqcn(dcqcn_global=_p("nicctl show dcqcn")) + environment = NicCtlEnvironment(environment=_p("nicctl show environment")) + version = NicCtlVersion( + version=_stdout("nicctl --version") or None, + version_firmware=_stdout("nicctl show version firmware") or None, + ) + return card_show, cards, port, lif, qos, rdma, dcqcn, environment, version + + +class NicCollector(InBandDataCollector[NicDataModel, NicCollectorArgs]): + """Collect raw output from niccli (Broadcom) and nicctl (Pensando) commands.""" + + DATA_MODEL = NicDataModel + + def collect_data( + self, + args: Optional[NicCollectorArgs] = None, + ) -> Tuple[TaskResult, Optional[NicDataModel]]: + """Run niccli/nicctl commands and store stdout/stderr/exit_code per command.""" + use_sudo_niccli = args.use_sudo_niccli if args else True + use_sudo_nicctl = args.use_sudo_nicctl if args else True + custom_commands = args.commands if args and args.commands else None + + results: dict[str, NicCommandResult] = {} + + # Detect niccli version to choose command set (legacy <= v233 vs new > v233) + niccli_version: Optional[int] = None + res_version = self._run_sut_cmd(NICCLI_VERSION_CMD, sudo=use_sudo_niccli) + if res_version.exit_code == 0 and res_version.stdout: + niccli_version = _parse_niccli_version(res_version.stdout) + results[NICCLI_VERSION_CMD] = NicCommandResult( + command=NICCLI_VERSION_CMD, + stdout=res_version.stdout or "", + stderr=res_version.stderr or "", + exit_code=res_version.exit_code, + ) + + # Discovery: device numbers from niccli + device_nums: List[int] = [] + discovery_cmds = _get_niccli_discovery_commands(niccli_version) + for list_cmd in discovery_cmds: + res = self._run_sut_cmd(list_cmd, sudo=use_sudo_niccli) + results[list_cmd] = NicCommandResult( + command=list_cmd, + stdout=res.stdout or "", + stderr=res.stderr or "", + exit_code=res.exit_code, + ) + if res.exit_code == 0 and res.stdout: + device_nums = _parse_niccli_device_numbers(res.stdout) + if device_nums: + break + + # Discovery: card IDs from nicctl show card (text); same output used for pensando_nic_cards + card_ids: List[str] = [] + card_list_from_text: List[Dict[str, Any]] = [] + res_card = self._run_sut_cmd(NICCTL_CARD_TEXT_CMD, sudo=use_sudo_nicctl) + results[NICCTL_CARD_TEXT_CMD] = NicCommandResult( + command=NICCTL_CARD_TEXT_CMD, + stdout=res_card.stdout or "", + stderr=res_card.stderr or "", + exit_code=res_card.exit_code, + ) + if res_card.exit_code == 0 and res_card.stdout: + legacy_cards = self._parse_nicctl_card(res_card.stdout) + card_ids = [c.id for c in legacy_cards] + card_list_from_text = [c.model_dump() for c in legacy_cards] + + # Build full command list (expand placeholders) + if custom_commands is not None: + commands_to_run: List[str] = [] + for tpl in custom_commands: + if "{device_num}" in tpl: + for d in device_nums: + commands_to_run.append(tpl.format(device_num=d)) + elif "{card_id}" in tpl: + for c in card_ids: + commands_to_run.append(tpl.format(card_id=c)) + else: + commands_to_run.append(tpl) + else: + commands_to_run = [] + # niccli list already stored + per_device_templates = _get_niccli_per_device_templates(niccli_version) + for tpl in per_device_templates: + for d in device_nums: + commands_to_run.append(tpl.format(device_num=d)) + # nicctl global (card discovery already done via NICCTL_CARD_TEXT_CMD) + for c in NICCTL_GLOBAL_COMMANDS: + commands_to_run.append(c) + for tpl in NICCTL_PER_CARD_TEMPLATES: + for cid in card_ids: + commands_to_run.append(tpl.format(card_id=cid)) + for cmd in NICCTL_LEGACY_TEXT_COMMANDS: + commands_to_run.append(cmd) + + # Run each command and store (artifact-only commands are not added to results / data model). + for cmd in commands_to_run: + if cmd in results: + continue + is_niccli = cmd.strip().startswith("niccli") + sudo = use_sudo_niccli if is_niccli else use_sudo_nicctl + res = self._run_sut_cmd(cmd, sudo=sudo) + has_error_output = has_command_error_output(res.stderr or "", res.stdout or "") + if _is_artifact_only_command(cmd): + if res.exit_code != 0: + self._log_event( + category=EventCategory.NETWORK, + description=f"niccli/nicctl command failed: {cmd}", + data=command_result_event_data(res), + priority=EventPriority.WARNING, + ) + elif has_error_output: + self._log_event( + category=EventCategory.NETWORK, + description=f"niccli/nicctl reported errors (exit 0): {cmd}", + data=command_result_event_data(res), + priority=EventPriority.WARNING, + ) + continue + results[cmd] = NicCommandResult( + command=cmd, + stdout=res.stdout or "", + stderr=res.stderr or "", + exit_code=res.exit_code, + ) + if res.exit_code != 0: + self._log_event( + category=EventCategory.NETWORK, + description=f"niccli/nicctl command failed: {cmd}", + data=command_result_event_data(res), + priority=EventPriority.WARNING, + ) + elif has_error_output: + self._log_event( + category=EventCategory.NETWORK, + description=f"niccli/nicctl reported errors (exit 0): {cmd}", + data=command_result_event_data(res), + priority=EventPriority.WARNING, + ) + + # Parse JSON for building structured domain objects (artifact-only commands have no stdout, so not in parsed). + parsed: Dict[str, Any] = {} + for cmd, r in results.items(): + if r.exit_code != 0 or not (r.stdout or "").strip(): + continue + try: + parsed[cmd] = json.loads(r.stdout.strip()) + except (ValueError, TypeError): + pass + + # Build structured domain objects from JSON/raw output (card_show/cards from text when present). + ( + card_show, + cards, + port, + lif, + qos, + rdma, + dcqcn, + environment, + version, + ) = _build_structured( + results, parsed, card_ids, card_list_override=card_list_from_text or None + ) + + # card_show and cards (can be large) go to TextFileArtifacts; excluded from datamodel. + if card_show is not None: + self.result.artifacts.append( + TextFileArtifact( + filename="niccli_card_show.json", + contents=card_show.model_dump_json(indent=2), + ) + ) + if cards: + self.result.artifacts.append( + TextFileArtifact( + filename="niccli_cards.json", + contents=json.dumps([c.model_dump(mode="json") for c in cards], indent=2), + ) + ) + + # Serialized nicclidatamodel.json: no stdout in results, truncated command/stderr (keeps file small). + # Command output lives on disk from _run_sut_cmd; model keeps only command identity and status. + def _truncate(s: str, max_len: int) -> str: + if not s or len(s) <= max_len: + return s or "" + return s[: max_len - 3] + "..." + + results_for_model = { + cmd: NicCommandResult( + command=_truncate(r.command, MAX_COMMAND_LENGTH_IN_DATAMODEL), + stdout="", + stderr=_truncate(r.stderr or "", MAX_STDERR_LENGTH_IN_DATAMODEL), + exit_code=r.exit_code, + ) + for cmd, r in results.items() + } + + # Legacy text parsers: populate broadcom_nic_* and pensando_nic_* for the datamodel. + ( + broadcom_devices, + broadcom_qos_data, + broadcom_support_rdma, + broadcom_performance_profile, + broadcom_pcie_relaxed_ordering, + ) = self._collect_broadcom_nic_structured(results, niccli_version=niccli_version) + ( + pensando_cards, + pensando_dcqcn, + pensando_environment, + pensando_lif, + pensando_pcie_ats, + pensando_ports, + pensando_qos, + pensando_rdma_statistics, + pensando_version_host_software, + pensando_version_firmware, + ) = self._collect_pensando_nic_structured(results) + + if not results or all(r.exit_code != 0 for r in results.values()): + self.result.status = ExecutionStatus.EXECUTION_FAILURE + self.result.message = "All niccli/nicctl commands failed or no commands were run" + else: + self.result.status = ExecutionStatus.OK + self.result.message = f"Collected {len(results)} niccli/nicctl command results" + + nicctl_card_logs = None + if card_show is not None: + nicctl_card_logs = { + "boot_fault": (card_show.logs_boot_fault or ""), + "persistent": (card_show.logs_persistent or ""), + "non_persistent": (card_show.logs_non_persistent or ""), + } + + return self.result, NicDataModel( + results=results_for_model, + card_show=None, + cards=[], + nicctl_card_logs=nicctl_card_logs, + port=port, + lif=lif, + qos=qos, + rdma=rdma, + dcqcn=dcqcn, + environment=environment, + version=version, + broadcom_nic_devices=broadcom_devices, + broadcom_nic_qos=broadcom_qos_data, + broadcom_nic_support_rdma=broadcom_support_rdma, + broadcom_nic_performance_profile=broadcom_performance_profile, + broadcom_nic_pcie_relaxed_ordering=broadcom_pcie_relaxed_ordering, + pensando_nic_cards=pensando_cards, + pensando_nic_dcqcn=pensando_dcqcn, + pensando_nic_environment=pensando_environment, + pensando_nic_lif=pensando_lif, + pensando_nic_pcie_ats=pensando_pcie_ats, + pensando_nic_ports=pensando_ports, + pensando_nic_qos=pensando_qos, + pensando_nic_rdma_statistics=pensando_rdma_statistics, + pensando_nic_version_host_software=pensando_version_host_software, + pensando_nic_version_firmware=pensando_version_firmware, + ) + + def _collect_broadcom_nic_structured( + self, + results: Dict[str, NicCommandResult], + niccli_version: Optional[int] = None, + ) -> Tuple[ + List[NicCliDevice], Dict[int, NicCliQos], Dict[int, str], Dict[int, str], Dict[int, str] + ]: + """Build niccli (Broadcom) structured data from results using legacy text parsers.""" + devices: List[NicCliDevice] = [] + qos_data: Dict[int, NicCliQos] = {} + support_rdma: Dict[int, str] = {} + performance_profile: Dict[int, str] = {} + pcie_relaxed_ordering: Dict[int, str] = {} + list_stdout: Optional[str] = None + discovery_cmds = _get_niccli_discovery_commands(niccli_version) + for list_cmd in discovery_cmds: + r = results.get(list_cmd) + if r and r.exit_code == 0 and (r.stdout or "").strip(): + list_stdout = r.stdout + break + if not list_stdout: + return devices, qos_data, support_rdma, performance_profile, pcie_relaxed_ordering + devices = self._parse_niccli_listdev(list_stdout) + templates = _get_niccli_per_device_templates(niccli_version) + support_rdma_tpl, perf_tpl, pcie_ro_tpl, qos_tpl = ( + templates[0], + templates[1], + templates[2], + templates[3], + ) + for device in devices: + qos_cmd = qos_tpl.format(device_num=device.device_num) + r = results.get(qos_cmd) + if r and r.exit_code == 0 and (r.stdout or "").strip(): + qos_data[device.device_num] = self._parse_niccli_qos( + device.device_num, r.stdout or "" + ) + support_rdma_cmd = support_rdma_tpl.format(device_num=device.device_num) + r_sr = results.get(support_rdma_cmd) + if r_sr and r_sr.exit_code == 0 and (r_sr.stdout or "").strip(): + support_rdma[device.device_num] = (r_sr.stdout or "").strip() + perf_cmd = perf_tpl.format(device_num=device.device_num) + r_pp = results.get(perf_cmd) + if r_pp and r_pp.exit_code == 0 and (r_pp.stdout or "").strip(): + performance_profile[device.device_num] = (r_pp.stdout or "").strip() + ro_cmd = pcie_ro_tpl.format(device_num=device.device_num) + r_ro = results.get(ro_cmd) + if r_ro and r_ro.exit_code == 0 and (r_ro.stdout or "").strip(): + pcie_relaxed_ordering[device.device_num] = (r_ro.stdout or "").strip() + return devices, qos_data, support_rdma, performance_profile, pcie_relaxed_ordering + + def _collect_pensando_nic_structured(self, results: Dict[str, NicCommandResult]) -> Tuple[ + List[PensandoNicCard], + List[PensandoNicDcqcn], + List[PensandoNicEnvironment], + List[PensandoNicLif], + List[PensandoNicPcieAts], + List[PensandoNicPort], + List[PensandoNicQos], + List[PensandoNicRdmaStatistics], + Optional[PensandoNicVersionHostSoftware], + List[PensandoNicVersionFirmware], + ]: + """Build Pensando NIC structured data from results using legacy text parsers.""" + + def _stdout(cmd: str) -> str: + r = results.get(cmd) + return (r.stdout or "").strip() if r and r.exit_code == 0 else "" + + cards = self._parse_nicctl_card(_stdout("nicctl show card")) + dcqcn_entries = self._parse_nicctl_dcqcn(_stdout("nicctl show dcqcn")) + environment_entries = self._parse_nicctl_environment(_stdout("nicctl show environment")) + lif_entries = self._parse_nicctl_lif(_stdout("nicctl show lif")) + pcie_ats_entries = self._parse_nicctl_pcie_ats(_stdout("nicctl show pcie ats")) + port_entries = self._parse_nicctl_port(_stdout("nicctl show port")) + qos_entries = self._parse_nicctl_qos(_stdout("nicctl show qos")) + rdma_statistics_entries = self._parse_nicctl_rdma_statistics( + _stdout("nicctl show rdma statistics") + ) + version_host_software = self._parse_nicctl_version_host_software( + _stdout("nicctl show version host-software") + ) + version_firmware_entries = self._parse_nicctl_version_firmware( + _stdout("nicctl show version firmware") + ) + + return ( + cards, + dcqcn_entries, + environment_entries, + lif_entries, + pcie_ats_entries, + port_entries, + qos_entries, + rdma_statistics_entries, + version_host_software, + version_firmware_entries, + ) + + def _parse_niccli_listdev(self, stdout: str) -> List[NicCliDevice]: + """Parse niccli --list_devices output into NicCliDevice list.""" + devices: List[NicCliDevice] = [] + current_num: Optional[int] = None + model = adapter_port = interface_name = mac_address = pci_address = None + for line in stdout.splitlines(): + line = line.strip() + if not line: + continue + num_match = re.match(r"^(\d+)\s*\)\s*(.*)", line) + if num_match: + if current_num is not None and model is not None: + devices.append( + NicCliDevice( + device_num=current_num, + model=model.strip() or None, + adapter_port=adapter_port, + interface_name=interface_name, + mac_address=mac_address, + pci_address=pci_address, + ) + ) + current_num = int(num_match.group(1)) + rest = num_match.group(2).strip() + if rest and "(" in rest and ")" in rest: + model = re.sub(r"\s*\([^)]+\)\s*$", "", rest).strip() or None + port_match = re.search(r"\(([^)]+)\)\s*$", rest) + adapter_port = port_match.group(1).strip() if port_match else None + else: + model = rest or None + adapter_port = None + interface_name = mac_address = pci_address = None + continue + if current_num is None: + continue + if ":" in line: + key, _, val = line.partition(":") + key, val = key.strip().lower(), val.strip() + if "interface" in key or "device interface" in key: + interface_name = val or None + elif "mac" in key: + mac_address = val or None + elif "pci" in key: + pci_address = val or None + if current_num is not None and ( + model is not None or interface_name or mac_address or pci_address + ): + devices.append( + NicCliDevice( + device_num=current_num, + model=model, + adapter_port=adapter_port, + interface_name=interface_name, + mac_address=mac_address, + pci_address=pci_address, + ) + ) + return devices + + def _parse_niccli_qos(self, device_num: int, stdout: str) -> NicCliQos: + """Parse niccli -dev X qos --ets --show output.""" + prio_map: Dict[int, int] = {} + tc_bandwidth: List[int] = [] + tsa_map: Dict[int, str] = {} + pfc_enabled: Optional[int] = None + app_entries: List[NicCliQosAppEntry] = [] + tc_rate_limit: List[int] = [] + for line in stdout.splitlines(): + line = line.strip() + if "PRIO_MAP:" in line or "PRIO_MAP" in line: + for part in re.findall(r"(\d+):(\d+)", line): + prio_map[int(part[0])] = int(part[1]) + if "TC Bandwidth:" in line: + tc_bandwidth = [int(x) for x in re.findall(r"(\d+)%", line)] + if "TSA_MAP:" in line: + for i, m in enumerate(re.findall(r"\d+:(\w+)", line)): + tsa_map[i] = m + if "PFC enabled:" in line: + m = re.search(r"PFC enabled:\s*(\d+)", line, re.I) + if m: + pfc_enabled = int(m.group(1)) + if "APP#" in line and not app_entries: + app_entries = _parse_niccli_qos_app_entries(stdout) + if "TC Rate Limit:" in line: + tc_rate_limit = [int(x) for x in re.findall(r"(\d+)%", line)] + return NicCliQos( + device_num=device_num, + raw_output=stdout, + prio_map=prio_map, + tc_bandwidth=tc_bandwidth, + tsa_map=tsa_map, + pfc_enabled=pfc_enabled, + app_entries=app_entries, + tc_rate_limit=tc_rate_limit, + ) + + def _parse_nicctl_card(self, stdout: str) -> List[PensandoNicCard]: + """Parse nicctl show card (text table) into PensandoNicCard list.""" + cards: List[PensandoNicCard] = [] + for line in stdout.splitlines(): + line = line.strip() + if not line or line.startswith("-") or "PCIe BDF" in line or "Id " in line: + continue + parts = line.split() + if ( + len(parts) >= 2 + and re.match(r"^[0-9a-f-]{36}$", parts[0]) + and re.match(r"^[0-9a-f:.]{12,}$", parts[1]) + ): + card_id, pcie_bdf = parts[0], parts[1] + asic = parts[2] if len(parts) > 2 and not parts[2].startswith("0") else None + fw_partition = parts[3] if len(parts) > 3 and parts[3] in ("A", "B") else None + serial_number = parts[4] if len(parts) > 4 else None + cards.append( + PensandoNicCard( + id=card_id, + pcie_bdf=pcie_bdf, + asic=asic, + fw_partition=fw_partition, + serial_number=serial_number, + ) + ) + return cards + + def _parse_nicctl_dcqcn(self, stdout: str) -> List[PensandoNicDcqcn]: + """Parse nicctl show dcqcn (text) into PensandoNicDcqcn list.""" + entries: List[PensandoNicDcqcn] = [] + nic_id = pcie_bdf = None + lif_id = roce_device = dcqcn_profile_id = status = None + for line in stdout.splitlines(): + if "NIC :" in line or "NIC:" in line: + m = re.search(r"NIC\s*:\s*([^\s(]+)\s*\(([^)]+)\)", line) + if m: + nic_id, pcie_bdf = m.group(1).strip(), m.group(2).strip() + lif_id = roce_device = dcqcn_profile_id = status = None + if nic_id and "Lif id" in line and ":" in line: + lif_id = line.split(":", 1)[1].strip() + if nic_id and "ROCE device" in line and ":" in line: + roce_device = line.split(":", 1)[1].strip() + if nic_id and "DCQCN profile id" in line and ":" in line: + dcqcn_profile_id = line.split(":", 1)[1].strip() + if nic_id and "Status" in line and ":" in line: + status = line.split(":", 1)[1].strip() + entries.append( + PensandoNicDcqcn( + nic_id=nic_id, + pcie_bdf=pcie_bdf or "", + lif_id=lif_id, + roce_device=roce_device, + dcqcn_profile_id=dcqcn_profile_id, + status=status, + ) + ) + return entries + + def _parse_nicctl_environment(self, stdout: str) -> List[PensandoNicEnvironment]: + """Parse nicctl show environment (text) into PensandoNicEnvironment list.""" + entries: List[PensandoNicEnvironment] = [] + nic_id = pcie_bdf = None + data: Dict[str, Optional[float]] = {} + for line in stdout.splitlines(): + if "NIC :" in line or "NIC:" in line: + m = re.search(r"NIC\s*:\s*([^\s(]+)\s*\(([^)]+)\)", line) + if m: + if nic_id and pcie_bdf: + entries.append( + PensandoNicEnvironment( + nic_id=nic_id, + pcie_bdf=pcie_bdf, + total_power_drawn=data.get("total_power_drawn"), + core_power=data.get("core_power"), + arm_power=data.get("arm_power"), + local_board_temperature=data.get("local_board_temperature"), + die_temperature=data.get("die_temperature"), + input_voltage=data.get("input_voltage"), + core_voltage=data.get("core_voltage"), + core_frequency=data.get("core_frequency"), + cpu_frequency=data.get("cpu_frequency"), + p4_stage_frequency=data.get("p4_stage_frequency"), + ) + ) + nic_id, pcie_bdf = m.group(1).strip(), m.group(2).strip() + data = {} + if nic_id and ":" in line: + key, _, val = line.partition(":") + key, val = key.strip().lower(), val.strip() + try: + v = float(val) + if "total power" in key or "pin" in key: + data["total_power_drawn"] = v + elif "core power" in key or "pout1" in key: + data["core_power"] = v + elif "arm power" in key or "pout2" in key: + data["arm_power"] = v + elif "local board" in key: + data["local_board_temperature"] = v + elif "die temperature" in key: + data["die_temperature"] = v + elif "input voltage" in key: + data["input_voltage"] = v + elif "core voltage" in key: + data["core_voltage"] = v + elif "core frequency" in key: + data["core_frequency"] = v + elif "cpu frequency" in key: + data["cpu_frequency"] = v + elif "p4 stage" in key: + data["p4_stage_frequency"] = v + except ValueError: + pass + if nic_id and pcie_bdf: + entries.append( + PensandoNicEnvironment( + nic_id=nic_id, + pcie_bdf=pcie_bdf, + total_power_drawn=data.get("total_power_drawn"), + core_power=data.get("core_power"), + arm_power=data.get("arm_power"), + local_board_temperature=data.get("local_board_temperature"), + die_temperature=data.get("die_temperature"), + input_voltage=data.get("input_voltage"), + core_voltage=data.get("core_voltage"), + core_frequency=data.get("core_frequency"), + cpu_frequency=data.get("cpu_frequency"), + p4_stage_frequency=data.get("p4_stage_frequency"), + ) + ) + return entries + + def _parse_nicctl_lif(self, stdout: str) -> List[PensandoNicLif]: + """Parse nicctl show lif (text) into PensandoNicLif list.""" + entries: List[PensandoNicLif] = [] + nic_id = pcie_bdf = None + for line in stdout.splitlines(): + if "NIC " in line and ":" in line and "(" in line: + m = re.search(r"NIC\s*:\s*([^\s(]+)\s*\(([^)]+)\)", line) + if m: + nic_id, pcie_bdf = m.group(1).strip(), m.group(2).strip() + if "LIF :" in line or "Lif :" in line or "Lif:" in line: + rest = line.split(":", 1)[-1].strip() + lif_match = re.match(r"([0-9a-f-]{36})\s*\(([^)]*)\)", rest) + if lif_match and nic_id: + lif_id, lif_name = lif_match.group(1), lif_match.group(2).strip() + entries.append( + PensandoNicLif( + nic_id=nic_id, + pcie_bdf=pcie_bdf or "", + lif_id=lif_id, + lif_name=lif_name or None, + ) + ) + elif re.match(r"^[0-9a-f-]{36}$", rest.strip()) and nic_id: + entries.append( + PensandoNicLif( + nic_id=nic_id, + pcie_bdf=pcie_bdf or "", + lif_id=rest.strip(), + lif_name=None, + ) + ) + return entries + + def _parse_nicctl_pcie_ats(self, stdout: str) -> List[PensandoNicPcieAts]: + """Parse nicctl show pcie ats (text) into PensandoNicPcieAts list.""" + entries: List[PensandoNicPcieAts] = [] + for line in stdout.splitlines(): + m = re.search(r"NIC\s*:\s*([^\s(]+)\s*\(([^)]+)\)\s*:\s*(\w+)", line) + if m: + entries.append( + PensandoNicPcieAts( + nic_id=m.group(1).strip(), + pcie_bdf=m.group(2).strip(), + status=m.group(3).strip(), + ) + ) + return entries + + def _parse_nicctl_port(self, stdout: str) -> List[PensandoNicPort]: + """Parse nicctl show port (text) into PensandoNicPort list.""" + entries: List[PensandoNicPort] = [] + nic_id = pcie_bdf = None + port_id = port_name = None + spec_speed = status_operational_status = None + for line in stdout.splitlines(): + if "NIC " in line and ":" in line and "(" in line: + m = re.search(r"NIC\s*:\s*([^\s(]+)\s*\(([^)]+)\)", line) + if m: + nic_id, pcie_bdf = m.group(1).strip(), m.group(2).strip() + port_id = port_name = None + if "Port :" in line or "Port:" in line: + if nic_id and port_id is not None: + entries.append( + PensandoNicPort( + nic_id=nic_id, + pcie_bdf=pcie_bdf or "", + port_id=port_id, + port_name=port_name or port_id, + spec_speed=spec_speed, + status_operational_status=status_operational_status, + ) + ) + rest = line.split(":", 1)[-1].strip() + port_match = re.match(r"([0-9a-f-]{36})\s*\(([^)]+)\)", rest) + if port_match: + port_id, port_name = port_match.group(1), port_match.group(2) + else: + port_id = rest if re.match(r"^[0-9a-f-]{36}$", rest.strip()) else None + port_name = "" + spec_speed = status_operational_status = None + if ( + nic_id + and "speed" in line + and ":" in line + and "Spec" not in line + and "Advertised" not in line + ): + spec_speed = line.split(":", 1)[1].strip() + if nic_id and "Operational status" in line and ":" in line: + status_operational_status = line.split(":", 1)[1].strip() + if nic_id and port_id is not None: + entries.append( + PensandoNicPort( + nic_id=nic_id, + pcie_bdf=pcie_bdf or "", + port_id=port_id, + port_name=port_name or port_id, + spec_speed=spec_speed, + status_operational_status=status_operational_status, + ) + ) + return entries + + def _parse_nicctl_qos(self, stdout: str) -> List[PensandoNicQos]: + """Parse nicctl show qos (text) into PensandoNicQos list.""" + entries: List[PensandoNicQos] = [] + nic_id = pcie_bdf = port_id = None + classification_type = None + scheduling: List[PensandoNicQosScheduling] = [] + for line in stdout.splitlines(): + if "NIC " in line and "(" in line: + m = re.search(r"NIC\s*:\s*([^\s(]+)\s*\(([^)]+)\)", line) + if m: + nic_id, pcie_bdf = m.group(1).strip(), m.group(2).strip() + port_id = None + scheduling = [] + if "Port :" in line: + port_match = re.search(r"([0-9a-f-]{36})", line) + port_id = port_match.group(1) if port_match else "" + if "Classification type" in line and ":" in line: + classification_type = line.split(":", 1)[1].strip() + if "DWRR" in line or "Scheduling" in line: + parts = line.split() + if len(parts) >= 3: + try: + prio = int(parts[0]) + sched_type = parts[1] if len(parts) > 1 else None + bw = int(parts[2]) if parts[2].isdigit() else None + rate = parts[3] if len(parts) > 3 else None + scheduling.append( + PensandoNicQosScheduling( + priority=prio, + scheduling_type=sched_type, + bandwidth=bw, + rate_limit=rate, + ) + ) + except (ValueError, IndexError): + pass + if nic_id and port_id and (classification_type is not None or scheduling): + entries.append( + PensandoNicQos( + nic_id=nic_id, + pcie_bdf=pcie_bdf or "", + port_id=port_id, + classification_type=classification_type, + scheduling=scheduling, + ) + ) + return entries + + def _parse_nicctl_rdma_statistics(self, stdout: str) -> List[PensandoNicRdmaStatistics]: + """Parse nicctl show rdma statistics (text) into PensandoNicRdmaStatistics list.""" + entries: List[PensandoNicRdmaStatistics] = [] + nic_id = pcie_bdf = None + stats: List[PensandoNicRdmaStatistic] = [] + for line in stdout.splitlines(): + if "NIC :" in line or "NIC:" in line: + m = re.search(r"NIC\s*:\s*([^\s(]+)\s*\(([^)]+)\)", line) + if m: + if nic_id and stats: + entries.append( + PensandoNicRdmaStatistics( + nic_id=nic_id, + pcie_bdf=pcie_bdf or "", + statistics=stats, + ) + ) + nic_id, pcie_bdf = m.group(1).strip(), m.group(2).strip() + stats = [] + if nic_id and ":" in line and "NIC" not in line: + key, _, val = line.partition(":") + name, val = key.strip(), val.strip() + try: + count = int(val) + stats.append(PensandoNicRdmaStatistic(name=name, count=count)) + except ValueError: + pass + if nic_id and stats: + entries.append( + PensandoNicRdmaStatistics( + nic_id=nic_id, + pcie_bdf=pcie_bdf or "", + statistics=stats, + ) + ) + return entries + + def _parse_nicctl_version_host_software( + self, stdout: str + ) -> Optional[PensandoNicVersionHostSoftware]: + """Parse nicctl show version host-software (text).""" + if not stdout or not stdout.strip(): + return None + version = ipc_driver = ionic_driver = None + for line in stdout.splitlines(): + if ":" in line: + key, _, val = line.partition(":") + key, val = key.strip().lower(), val.strip() + if "nicctl" in key: + version = val + elif "ipc" in key: + ipc_driver = val + elif "ionic" in key: + ionic_driver = val + return PensandoNicVersionHostSoftware( + version=version, + ipc_driver=ipc_driver, + ionic_driver=ionic_driver, + ) + + def _parse_nicctl_version_firmware(self, stdout: str) -> List[PensandoNicVersionFirmware]: + """Parse nicctl show version firmware (text) into PensandoNicVersionFirmware list.""" + entries: List[PensandoNicVersionFirmware] = [] + nic_id = pcie_bdf = None + cpld = boot0 = uboot_a = firmware_a = device_config_a = None + for line in stdout.splitlines(): + if "NIC :" in line or "NIC:" in line: + m = re.search(r"NIC\s*:\s*([^\s(]+)\s*\(([^)]+)\)", line) + if m: + if nic_id: + entries.append( + PensandoNicVersionFirmware( + nic_id=nic_id, + pcie_bdf=pcie_bdf or "", + cpld=cpld, + boot0=boot0, + uboot_a=uboot_a, + firmware_a=firmware_a, + device_config_a=device_config_a, + ) + ) + nic_id, pcie_bdf = m.group(1).strip(), m.group(2).strip() + cpld = boot0 = uboot_a = firmware_a = device_config_a = None + if nic_id and ":" in line: + key, _, val = line.partition(":") + key, val = key.strip().lower(), val.strip() + if "cpld" in key: + cpld = val + elif "boot0" in key: + boot0 = val + elif "uboot-a" in key or "uboot_a" in key: + uboot_a = val + elif "firmware-a" in key or "firmware_a" in key: + firmware_a = val + elif "device config" in key or "device_config" in key: + device_config_a = val + if nic_id: + entries.append( + PensandoNicVersionFirmware( + nic_id=nic_id, + pcie_bdf=pcie_bdf or "", + cpld=cpld, + boot0=boot0, + uboot_a=uboot_a, + firmware_a=firmware_a, + device_config_a=device_config_a, + ) + ) + return entries diff --git a/nodescraper/plugins/inband/nic/nic_data.py b/nodescraper/plugins/inband/nic/nic_data.py new file mode 100644 index 00000000..40e16216 --- /dev/null +++ b/nodescraper/plugins/inband/nic/nic_data.py @@ -0,0 +1,411 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +import re +from typing import Any, Dict, List, Optional + +from pydantic import BaseModel, Field + +from nodescraper.models import DataModel + + +class NicCtlCardShow(BaseModel): + """Outputs from global 'nicctl show card *' commands (flash, interrupts, logs, profile, time, statistics).""" + + flash_partition: Optional[Any] = None + interrupts: Optional[Any] = None + logs_non_persistent: Optional[str] = None + logs_boot_fault: Optional[str] = None + logs_persistent: Optional[str] = None + profile: Optional[Any] = None + time: Optional[Any] = None + statistics_packet_buffer_summary: Optional[Any] = None + + +class NicCtlCard(BaseModel): + """Per-card data: identity from 'nicctl show card' plus per-card commands (hardware-config, dcqcn).""" + + card_id: str + info: Optional[Any] = Field( + default=None, description="Card entry from nicctl show card --json list." + ) + hardware_config: Optional[str] = Field( + default=None, description="Raw stdout from nicctl show card hardware-config --card {id}." + ) + dcqcn: Optional[Any] = Field( + default=None, description="Parsed JSON from nicctl show dcqcn --card {id} --json." + ) + + +class NicCtlPort(BaseModel): + """Outputs from 'nicctl show port *' commands.""" + + port: Optional[Any] = Field(default=None, description="Parsed from nicctl show port --json.") + port_fsm: Optional[str] = Field( + default=None, description="Raw stdout from nicctl show port fsm." + ) + port_transceiver: Optional[Any] = Field( + default=None, description="Parsed from nicctl show port transceiver --json." + ) + port_statistics: Optional[Any] = Field( + default=None, description="Parsed from nicctl show port statistics --json." + ) + port_internal_mac: Optional[str] = Field( + default=None, description="Raw stdout from nicctl show port internal mac." + ) + + +class NicCtlLif(BaseModel): + """Outputs from 'nicctl show lif *' commands.""" + + lif: Optional[Any] = Field(default=None, description="Parsed from nicctl show lif --json.") + lif_statistics: Optional[Any] = Field( + default=None, description="Parsed from nicctl show lif statistics --json." + ) + lif_internal_queue_to_ud_pinning: Optional[str] = Field( + default=None, + description="Raw stdout from nicctl show lif internal queue-to-ud-pinning.", + ) + + +class NicCtlQos(BaseModel): + """Outputs from 'nicctl show qos *' commands.""" + + qos: Optional[Any] = Field(default=None, description="Parsed from nicctl show qos --json.") + qos_headroom: Optional[Any] = Field( + default=None, description="Parsed from nicctl show qos headroom --json." + ) + + +class NicCtlRdma(BaseModel): + """Outputs from 'nicctl show rdma *' commands.""" + + rdma_queue: Optional[Any] = Field( + default=None, description="Parsed from nicctl show rdma queue --json." + ) + rdma_queue_pair_detail: Optional[Any] = Field( + default=None, + description="Parsed from nicctl show rdma queue-pair --detail --json.", + ) + rdma_statistics: Optional[Any] = Field( + default=None, description="Parsed from nicctl show rdma statistics --json." + ) + + +class NicCtlDcqcn(BaseModel): + """Global DCQCN output; per-card DCQCN is in NicCtlCard.dcqcn.""" + + dcqcn_global: Optional[Any] = Field( + default=None, description="Parsed from nicctl show dcqcn --json." + ) + + +class NicCtlEnvironment(BaseModel): + """Output from 'nicctl show environment'.""" + + environment: Optional[Any] = None + + +class NicCtlVersion(BaseModel): + """Version outputs from nicctl.""" + + version: Optional[str] = Field(default=None, description="Raw stdout from nicctl --version.") + version_firmware: Optional[str] = Field( + default=None, description="Raw stdout from nicctl show version firmware." + ) + + +class NicCliDevice(BaseModel): + """NIC device from niccli --list_devices (Broadcom).""" + + device_num: int + model: Optional[str] = None + adapter_port: Optional[str] = None + interface_name: Optional[str] = None + mac_address: Optional[str] = None + pci_address: Optional[str] = None + + +class NicCliQosAppEntry(BaseModel): + """APP TLV entry in niccli QoS output (Broadcom).""" + + priority: Optional[int] = None + sel: Optional[int] = None + dscp: Optional[int] = None + protocol: Optional[str] = None + port: Optional[int] = None + + +class NicCliQos(BaseModel): + """NIC QoS from niccli -dev X getqos / qos --ets --show (Broadcom).""" + + device_num: int + raw_output: str + prio_map: Dict[int, int] = Field(default_factory=dict) + tc_bandwidth: List[int] = Field(default_factory=list) + tsa_map: Dict[int, str] = Field(default_factory=dict) + pfc_enabled: Optional[int] = None + app_entries: List[NicCliQosAppEntry] = Field(default_factory=list) + tc_rate_limit: List[int] = Field(default_factory=list) + + +class PensandoNicCard(BaseModel): + """Pensando NIC card from nicctl show card (text).""" + + id: str + pcie_bdf: str + asic: Optional[str] = None + fw_partition: Optional[str] = None + serial_number: Optional[str] = None + + +class PensandoNicDcqcn(BaseModel): + """Pensando NIC DCQCN from nicctl show dcqcn (text).""" + + nic_id: str + pcie_bdf: str + lif_id: Optional[str] = None + roce_device: Optional[str] = None + dcqcn_profile_id: Optional[str] = None + status: Optional[str] = None + + +class PensandoNicEnvironment(BaseModel): + """Pensando NIC environment from nicctl show environment (text).""" + + nic_id: str + pcie_bdf: str + total_power_drawn: Optional[float] = None + core_power: Optional[float] = None + arm_power: Optional[float] = None + local_board_temperature: Optional[float] = None + die_temperature: Optional[float] = None + input_voltage: Optional[float] = None + core_voltage: Optional[float] = None + core_frequency: Optional[float] = None + cpu_frequency: Optional[float] = None + p4_stage_frequency: Optional[float] = None + + +class PensandoNicPcieAts(BaseModel): + """Pensando NIC PCIe ATS from nicctl show pcie ats (text).""" + + nic_id: str + pcie_bdf: str + status: str + + +class PensandoNicLif(BaseModel): + """Pensando NIC LIF from nicctl show lif (text).""" + + nic_id: str + pcie_bdf: str + lif_id: str + lif_name: Optional[str] = None + + +class PensandoNicPort(BaseModel): + """Pensando NIC port from nicctl show port (text).""" + + nic_id: str + pcie_bdf: str + port_id: str + port_name: str + spec_ifindex: Optional[str] = None + spec_type: Optional[str] = None + spec_speed: Optional[str] = None + spec_admin_state: Optional[str] = None + spec_fec_type: Optional[str] = None + spec_pause_type: Optional[str] = None + spec_num_lanes: Optional[int] = None + spec_mtu: Optional[int] = None + spec_tx_pause: Optional[str] = None + spec_rx_pause: Optional[str] = None + spec_auto_negotiation: Optional[str] = None + status_physical_port: Optional[int] = None + status_operational_status: Optional[str] = None + status_link_fsm_state: Optional[str] = None + status_fec_type: Optional[str] = None + status_cable_type: Optional[str] = None + status_num_lanes: Optional[int] = None + status_speed: Optional[str] = None + status_auto_negotiation: Optional[str] = None + status_mac_id: Optional[int] = None + status_mac_channel: Optional[int] = None + status_mac_address: Optional[str] = None + status_transceiver_type: Optional[str] = None + status_transceiver_state: Optional[str] = None + status_transceiver_pid: Optional[str] = None + + +class PensandoNicQosScheduling(BaseModel): + """QoS Scheduling entry.""" + + priority: int + scheduling_type: Optional[str] = None + bandwidth: Optional[int] = None + rate_limit: Optional[str] = None + + +class PensandoNicQos(BaseModel): + """Pensando NIC QoS from nicctl show qos (text).""" + + nic_id: str + pcie_bdf: str + port_id: str + classification_type: Optional[str] = None + dscp_bitmap: Optional[str] = None + dscp_range: Optional[str] = None + dscp_priority: Optional[int] = None + pfc_priority_bitmap: Optional[str] = None + pfc_no_drop_priorities: Optional[str] = None + scheduling: List[PensandoNicQosScheduling] = Field(default_factory=list) + + +class PensandoNicRdmaStatistic(BaseModel): + """RDMA statistic entry.""" + + name: str + count: int + + +class PensandoNicRdmaStatistics(BaseModel): + """Pensando NIC RDMA statistics from nicctl show rdma statistics (text).""" + + nic_id: str + pcie_bdf: str + statistics: List[PensandoNicRdmaStatistic] = Field(default_factory=list) + + +class PensandoNicVersionHostSoftware(BaseModel): + """Pensando NIC host software version from nicctl show version host-software.""" + + version: Optional[str] = None + ipc_driver: Optional[str] = None + ionic_driver: Optional[str] = None + + +class PensandoNicVersionFirmware(BaseModel): + """Pensando NIC firmware version from nicctl show version firmware (text).""" + + nic_id: str + pcie_bdf: str + cpld: Optional[str] = None + boot0: Optional[str] = None + uboot_a: Optional[str] = None + firmware_a: Optional[str] = None + device_config_a: Optional[str] = None + + +def command_to_canonical_key(command: str) -> str: + """Turn a full command string into a stable key. + + E.g. 'nicctl show card --json' -> 'nicctl_show_card_json', + 'nicctl show dcqcn --card 0 --json' -> 'nicctl_show_dcqcn_card_0_json'. + """ + s = command.strip().lower() + s = re.sub(r"\s+", "_", s) + s = re.sub(r"--+", "_", s) + s = s.strip("_") + s = re.sub(r"_+", "_", s) + return s or "unknown" + + +class NicCommandResult(BaseModel): + """Result of a single niccli/nicctl command run.""" + + command: str + stdout: str = "" + stderr: str = "" + exit_code: int = 0 + + @property + def succeeded(self) -> bool: + """True if the command exited with code 0.""" + return self.exit_code == 0 + + +class NicDataModel(DataModel): + """Collected output of niccli (Broadcom) and nicctl (Pensando) commands.""" + + results: Dict[str, NicCommandResult] = Field(default_factory=dict) + + # Structured by domain (parsed from command output in collector) + card_show: Optional[NicCtlCardShow] = Field( + default=None, description="Global nicctl show card * outputs." + ) + cards: List[NicCtlCard] = Field( + default_factory=list, description="Per-card data (card list + hardware-config, dcqcn)." + ) + port: Optional[NicCtlPort] = None + lif: Optional[NicCtlLif] = None + qos: Optional[NicCtlQos] = None + rdma: Optional[NicCtlRdma] = None + dcqcn: Optional[NicCtlDcqcn] = None + environment: Optional[NicCtlEnvironment] = None + version: Optional[NicCtlVersion] = None + + broadcom_nic_devices: List[NicCliDevice] = Field(default_factory=list) + broadcom_nic_qos: Dict[int, NicCliQos] = Field(default_factory=dict) + broadcom_nic_support_rdma: Dict[int, str] = Field( + default_factory=dict, + description="Per-device output of 'niccli -dev X nvm -getoption support_rdma -scope 0' (device_num -> raw stdout).", + ) + broadcom_nic_performance_profile: Dict[int, str] = Field( + default_factory=dict, + description="Per-device output of 'niccli -dev X nvm -getoption performance_profile' (device_num -> raw stdout).", + ) + broadcom_nic_pcie_relaxed_ordering: Dict[int, str] = Field( + default_factory=dict, + description="Per-device output of 'niccli -dev X nvm -getoption pcie_relaxed_ordering' (device_num -> raw stdout).", + ) + pensando_nic_cards: List[PensandoNicCard] = Field(default_factory=list) + pensando_nic_dcqcn: List[PensandoNicDcqcn] = Field(default_factory=list) + pensando_nic_environment: List[PensandoNicEnvironment] = Field(default_factory=list) + pensando_nic_lif: List[PensandoNicLif] = Field(default_factory=list) + pensando_nic_pcie_ats: List[PensandoNicPcieAts] = Field(default_factory=list) + pensando_nic_ports: List[PensandoNicPort] = Field(default_factory=list) + pensando_nic_qos: List[PensandoNicQos] = Field(default_factory=list) + pensando_nic_rdma_statistics: List[PensandoNicRdmaStatistics] = Field(default_factory=list) + pensando_nic_version_host_software: Optional[PensandoNicVersionHostSoftware] = None + pensando_nic_version_firmware: List[PensandoNicVersionFirmware] = Field(default_factory=list) + + # Raw nicctl card log output for regex-based error detection + nicctl_card_logs: Optional[Dict[str, str]] = Field( + default=None, + description="Log text from 'nicctl show card logs --boot-fault', --persistent, --non-persistent (keys: boot_fault, persistent, non_persistent).", + ) + + def command_succeeded(self, command: str) -> bool: + """Return True if the command ran and exited with code 0.""" + r = self.results.get(command) + return r is not None and r.succeeded + + def get_card(self, card_id: str) -> Optional[NicCtlCard]: + """Return the per-card data for the given card id.""" + for c in self.cards: + if c.card_id == card_id: + return c + return None diff --git a/nodescraper/plugins/inband/nic/nic_plugin.py b/nodescraper/plugins/inband/nic/nic_plugin.py new file mode 100644 index 00000000..b26ac77b --- /dev/null +++ b/nodescraper/plugins/inband/nic/nic_plugin.py @@ -0,0 +1,46 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from nodescraper.base import InBandDataPlugin + +from .analyzer_args import NicAnalyzerArgs +from .collector_args import NicCollectorArgs +from .nic_analyzer import NicAnalyzer +from .nic_collector import NicCollector +from .nic_data import NicDataModel + + +class NicPlugin(InBandDataPlugin[NicDataModel, NicCollectorArgs, NicAnalyzerArgs]): + """Plugin for collecting niccli (Broadcom) and nicctl (Pensando) command output. + + Data is parsed into structured fields (card_show, cards, port, lif, qos, etc.). + The analyzer checks Broadcom support_rdma (niccli -dev x nvm -getoption support_rdma -scope 0). + """ + + DATA_MODEL = NicDataModel + COLLECTOR = NicCollector + COLLECTOR_ARGS = NicCollectorArgs + ANALYZER = NicAnalyzer + ANALYZER_ARGS = NicAnalyzerArgs diff --git a/nodescraper/plugins/inband/rdma/__init__.py b/nodescraper/plugins/inband/rdma/__init__.py new file mode 100644 index 00000000..5c7cc181 --- /dev/null +++ b/nodescraper/plugins/inband/rdma/__init__.py @@ -0,0 +1,28 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from .rdma_plugin import RdmaPlugin + +__all__ = ["RdmaPlugin"] diff --git a/nodescraper/plugins/inband/rdma/rdma_analyzer.py b/nodescraper/plugins/inband/rdma/rdma_analyzer.py new file mode 100644 index 00000000..163602b0 --- /dev/null +++ b/nodescraper/plugins/inband/rdma/rdma_analyzer.py @@ -0,0 +1,186 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from typing import Optional + +from nodescraper.enums import EventCategory, EventPriority, ExecutionStatus +from nodescraper.interfaces import DataAnalyzer +from nodescraper.models import TaskResult + +from .rdmadata import RdmaDataModel + + +class RdmaAnalyzer(DataAnalyzer[RdmaDataModel, None]): + """Check RDMA statistics for errors (RoCE and other RDMA error counters).""" + + DATA_MODEL = RdmaDataModel + + # Error fields checked from rdma statistic output (bnxt_re, mlx5, ionic, etc.) + ERROR_FIELDS = [ + "recoverable_errors", + "tx_roce_errors", + "tx_roce_discards", + "rx_roce_errors", + "rx_roce_discards", + "local_ack_timeout_err", + "packet_seq_err", + "max_retry_exceeded", + "rnr_nak_retry_err", + "implied_nak_seq_err", + "unrecoverable_err", + "bad_resp_err", + "local_qp_op_err", + "local_protection_err", + "mem_mgmt_op_err", + "req_remote_invalid_request", + "req_remote_access_errors", + "remote_op_err", + "duplicate_request", + "res_exceed_max", + "resp_local_length_error", + "res_exceeds_wqe", + "res_opcode_err", + "res_rx_invalid_rkey", + "res_rx_domain_err", + "res_rx_no_perm", + "res_rx_range_err", + "res_tx_invalid_rkey", + "res_tx_domain_err", + "res_tx_no_perm", + "res_tx_range_err", + "res_irrq_oflow", + "res_unsup_opcode", + "res_unaligned_atomic", + "res_rem_inv_err", + "res_mem_err", + "res_srq_err", + "res_cmp_err", + "res_invalid_dup_rkey", + "res_wqe_format_err", + "res_cq_load_err", + "res_srq_load_err", + "res_tx_pci_err", + "res_rx_pci_err", + "out_of_buffer", + "out_of_sequence", + "req_cqe_error", + "req_cqe_flush_error", + "resp_cqe_error", + "resp_cqe_flush_error", + "resp_remote_access_errors", + "req_rx_pkt_seq_err", + "req_rx_rnr_retry_err", + "req_rx_rmt_acc_err", + "req_rx_rmt_req_err", + "req_rx_oper_err", + "req_rx_impl_nak_seq_err", + "req_rx_cqe_err", + "req_rx_cqe_flush", + "req_rx_dup_response", + "req_rx_inval_pkts", + "req_tx_loc_acc_err", + "req_tx_loc_oper_err", + "req_tx_mem_mgmt_err", + "req_tx_retry_excd_err", + "req_tx_loc_sgl_inv_err", + "resp_rx_dup_request", + "resp_rx_outof_buf", + "resp_rx_outouf_seq", + "resp_rx_cqe_err", + "resp_rx_cqe_flush", + "resp_rx_loc_len_err", + "resp_rx_inval_request", + "resp_rx_loc_oper_err", + "resp_rx_outof_atomic", + "resp_tx_pkt_seq_err", + "resp_tx_rmt_inval_req_err", + "resp_tx_rmt_acc_err", + "resp_tx_rmt_oper_err", + "resp_tx_rnr_retry_err", + "resp_tx_loc_sgl_inv_err", + "resp_rx_s0_table_err", + "resp_rx_ccl_cts_outouf_seq", + "tx_rdma_ack_timeout", + "tx_rdma_ccl_cts_ack_timeout", + "rx_rdma_mtu_discard_pkts", + ] + + CRITICAL_ERROR_FIELDS = [ + "unrecoverable_err", + "res_tx_pci_err", + "res_rx_pci_err", + "res_mem_err", + ] + + def analyze_data(self, data: RdmaDataModel, args: Optional[None] = None) -> TaskResult: + """Analyze RDMA statistics for non-zero error counters. + + Args: + data: RDMA data model with statistic_list (and optionally link_list). + args: Unused (analyzer has no configurable args). + + Returns: + TaskResult with status OK if no errors, ERROR if any error counter > 0. + """ + if not data.statistic_list: + self.result.message = "No RDMA devices found" + self.result.status = ExecutionStatus.WARNING + return self.result + + error_state = False + for idx, stat in enumerate(data.statistic_list): + errors_on_interface = [] # (error_field, value, is_critical) + for error_field in self.ERROR_FIELDS: + value = getattr(stat, error_field, None) + if value is not None and value > 0: + is_critical = error_field in self.CRITICAL_ERROR_FIELDS + errors_on_interface.append((error_field, value, is_critical)) + if errors_on_interface: + error_state = True + interface_label = stat.ifname or "unknown" + error_names = [e[0] for e in errors_on_interface] + any_critical = any(e[2] for e in errors_on_interface) + priority = EventPriority.CRITICAL if any_critical else EventPriority.ERROR + errors_data = {field: value for field, value, _ in errors_on_interface} + self._log_event( + category=EventCategory.IO, + description=f"RDMA error detected on {interface_label}: [{', '.join(error_names)}]", + data={ + "interface": stat.ifname, + "port": stat.port, + "errors": errors_data, + "statistic_index": idx, + }, + priority=priority, + console_log=True, + ) + + if error_state: + self.result.message = "RDMA errors detected in statistics" + self.result.status = ExecutionStatus.ERROR + else: + self.result.message = "No RDMA errors detected in statistics" + self.result.status = ExecutionStatus.OK + return self.result diff --git a/nodescraper/plugins/inband/rdma/rdma_collector.py b/nodescraper/plugins/inband/rdma/rdma_collector.py new file mode 100644 index 00000000..b5e01b5c --- /dev/null +++ b/nodescraper/plugins/inband/rdma/rdma_collector.py @@ -0,0 +1,315 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +import json +import re +from typing import Optional + +from pydantic import ValidationError + +from nodescraper.base import InBandDataCollector +from nodescraper.enums import EventCategory, EventPriority, ExecutionStatus, OSFamily +from nodescraper.models import TaskResult +from nodescraper.utils import get_exception_traceback + +from .rdmadata import RdmaDataModel, RdmaDevice, RdmaLink, RdmaLinkText, RdmaStatistics + + +class RdmaCollector(InBandDataCollector[RdmaDataModel, None]): + """Collect RDMA status and statistics via rdma link and rdma statistic commands.""" + + DATA_MODEL = RdmaDataModel + SUPPORTED_OS_FAMILY = {OSFamily.LINUX} + + CMD_LINK = "rdma link -j" + CMD_STATISTIC = "rdma statistic -j" + CMD_RDMA_DEV = "rdma dev" + CMD_RDMA_LINK = "rdma link" + + def _run_rdma_command(self, cmd: str) -> Optional[list[dict]]: + """Run rdma command with JSON output. + + Args: + cmd: Full command string (e.g. CMD_LINK or CMD_STATISTIC). + + Returns: + List of dicts from JSON output, or None on failure. + """ + res = self._run_sut_cmd(cmd) + + if res.exit_code != 0: + self._log_event( + category=EventCategory.NETWORK, + description=f"Error running rdma command: {cmd}", + data={ + "command": cmd, + "exit_code": res.exit_code, + "stderr": res.stderr, + }, + priority=EventPriority.ERROR, + console_log=True, + ) + return None + + if not res.stdout.strip(): + return [] + + try: + return json.loads(res.stdout) + except json.JSONDecodeError as e: + self._log_event( + category=EventCategory.NETWORK, + description=f"Error parsing command: {cmd} json data", + data={ + "cmd": cmd, + "exception": get_exception_traceback(e), + }, + priority=EventPriority.ERROR, + console_log=True, + ) + return None + + def _parse_rdma_dev(self, output: str) -> list[RdmaDevice]: + """Parse 'rdma dev' output into RdmaDevice objects.""" + devices = [] + for line in output.splitlines(): + line = line.strip() + if not line: + continue + parts = line.split() + if len(parts) < 2: + continue + device_name = None + start_idx = 0 + if parts[0].endswith(":"): + start_idx = 1 + if start_idx < len(parts): + device_name = parts[start_idx].rstrip(":") + start_idx += 1 + if not device_name: + continue + device = RdmaDevice(device=device_name) + i = start_idx + while i < len(parts): + if parts[i] == "node_type" and i + 1 < len(parts): + device.node_type = parts[i + 1] + i += 2 + elif parts[i] == "fw" and i + 1 < len(parts): + device.attributes["fw_version"] = parts[i + 1] + i += 2 + elif parts[i] == "node_guid" and i + 1 < len(parts): + device.node_guid = parts[i + 1] + i += 2 + elif parts[i] == "sys_image_guid" and i + 1 < len(parts): + device.sys_image_guid = parts[i + 1] + i += 2 + elif parts[i] == "state" and i + 1 < len(parts): + device.state = parts[i + 1] + i += 2 + else: + if i + 1 < len(parts) and not parts[i + 1].startswith("-"): + device.attributes[parts[i]] = parts[i + 1] + i += 2 + else: + i += 1 + devices.append(device) + return devices + + def _parse_rdma_link_text(self, output: str) -> list[RdmaLinkText]: + """Parse 'rdma link' (text) output into RdmaLinkText objects.""" + links = [] + for line in output.splitlines(): + line = line.strip() + if not line: + continue + match = re.search(r"(\S+)/(\d+)", line) + if not match: + continue + device_name = match.group(1) + port = int(match.group(2)) + link = RdmaLinkText(device=device_name, port=port) + parts = line.split() + i = 0 + while i < len(parts): + if parts[i] == "state" and i + 1 < len(parts): + link.state = parts[i + 1] + i += 2 + elif parts[i] == "physical_state" and i + 1 < len(parts): + link.physical_state = parts[i + 1] + i += 2 + elif parts[i] == "netdev" and i + 1 < len(parts): + link.netdev = parts[i + 1] + i += 2 + else: + if i + 1 < len(parts) and not parts[i + 1].startswith("-"): + link.attributes[parts[i]] = parts[i + 1] + i += 2 + else: + i += 1 + links.append(link) + return links + + def _get_rdma_statistics(self) -> Optional[list[RdmaStatistics]]: + """Get RDMA statistics from 'rdma statistic -j'.""" + stat_data = self._run_rdma_command(self.CMD_STATISTIC) + if stat_data is None: + return None + if not stat_data: + return [] + + try: + statistics = [] + for stat in stat_data: + if not isinstance(stat, dict): + self._log_event( + category=EventCategory.NETWORK, + description="Invalid data type for RDMA statistic", + data={"data_type": type(stat).__name__}, + priority=EventPriority.WARNING, + ) + continue + statistics.append(RdmaStatistics(**stat)) + except ValidationError as e: + self._log_event( + category=EventCategory.NETWORK, + description="Failed to build RdmaStatistics model", + data={"exception": get_exception_traceback(e)}, + priority=EventPriority.WARNING, + ) + return statistics + + def _get_rdma_link(self) -> Optional[list[RdmaLink]]: + """Get RDMA link data from 'rdma link -j'.""" + link_data = self._run_rdma_command(self.CMD_LINK) + if link_data is None: + return None + if not link_data: + return [] + + try: + links = [] + for link in link_data: + if not isinstance(link, dict): + self._log_event( + category=EventCategory.NETWORK, + description="Invalid data type for RDMA link", + data={"data_type": type(link).__name__}, + priority=EventPriority.WARNING, + ) + continue + links.append(RdmaLink(**link)) + return links + except ValidationError as e: + self._log_event( + category=EventCategory.NETWORK, + description="Failed to build RdmaLink model", + data={"exception": get_exception_traceback(e)}, + priority=EventPriority.WARNING, + ) + return links + + def collect_data(self, args: None = None) -> tuple[TaskResult, Optional[RdmaDataModel]]: + """Collect RDMA statistics, link data, and device/link text output. + + Returns: + Task result and RdmaDataModel, or None if all commands failed. + """ + try: + links = self._get_rdma_link() + statistics = self._get_rdma_statistics() + + dev_list: list[RdmaDevice] = [] + res_rdma_dev = self._run_sut_cmd(self.CMD_RDMA_DEV) + if res_rdma_dev.exit_code == 0: + dev_list = self._parse_rdma_dev(res_rdma_dev.stdout) + self._log_event( + category=EventCategory.NETWORK, + description=f"Collected {len(dev_list)} RDMA devices from 'rdma dev'", + priority=EventPriority.INFO, + ) + else: + self._log_event( + category=EventCategory.NETWORK, + description="Error or no output from 'rdma dev'", + data={"command": self.CMD_RDMA_DEV, "exit_code": res_rdma_dev.exit_code}, + priority=EventPriority.WARNING, + ) + + link_list_text: list[RdmaLinkText] = [] + res_rdma_link = self._run_sut_cmd(self.CMD_RDMA_LINK) + if res_rdma_link.exit_code == 0: + link_list_text = self._parse_rdma_link_text(res_rdma_link.stdout) + self._log_event( + category=EventCategory.NETWORK, + description=f"Collected {len(link_list_text)} RDMA links from 'rdma link'", + priority=EventPriority.INFO, + ) + else: + self._log_event( + category=EventCategory.NETWORK, + description="Error or no output from 'rdma link'", + data={"command": self.CMD_RDMA_LINK, "exit_code": res_rdma_link.exit_code}, + priority=EventPriority.WARNING, + ) + + if statistics is None and links is None and not dev_list and not link_list_text: + self.result.status = ExecutionStatus.EXECUTION_FAILURE + self.result.message = "Failed to collect RDMA data" + return self.result, None + + rdma_data = RdmaDataModel( + statistic_list=statistics if statistics is not None else [], + link_list=links if links is not None else [], + dev_list=dev_list, + link_list_text=link_list_text, + ) + if ( + not rdma_data.statistic_list + and not rdma_data.link_list + and not rdma_data.dev_list + and not rdma_data.link_list_text + ): + self.result.status = ExecutionStatus.WARNING + self.result.message = "No RDMA devices found" + return self.result, None + + self.result.message = ( + f"Collected {len(rdma_data.statistic_list)} RDMA statistics, " + f"{len(rdma_data.link_list)} RDMA links (JSON), " + f"{len(rdma_data.dev_list)} devices, {len(rdma_data.link_list_text)} links (text)" + ) + self.result.status = ExecutionStatus.OK + return self.result, rdma_data + + except Exception as e: + self._log_event( + category=EventCategory.NETWORK, + description="Error running RDMA collector", + data={"exception": get_exception_traceback(e)}, + priority=EventPriority.ERROR, + console_log=True, + ) + self.result.status = ExecutionStatus.EXECUTION_FAILURE + return self.result, None diff --git a/nodescraper/plugins/inband/rdma/rdma_plugin.py b/nodescraper/plugins/inband/rdma/rdma_plugin.py new file mode 100644 index 00000000..fac85862 --- /dev/null +++ b/nodescraper/plugins/inband/rdma/rdma_plugin.py @@ -0,0 +1,38 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from nodescraper.base import InBandDataPlugin + +from .rdma_analyzer import RdmaAnalyzer +from .rdma_collector import RdmaCollector +from .rdmadata import RdmaDataModel + + +class RdmaPlugin(InBandDataPlugin[RdmaDataModel, None, None]): + """Plugin for collection and analysis of RDMA statistics and link data.""" + + DATA_MODEL = RdmaDataModel + COLLECTOR = RdmaCollector + ANALYZER = RdmaAnalyzer diff --git a/nodescraper/plugins/inband/rdma/rdmadata.py b/nodescraper/plugins/inband/rdma/rdmadata.py new file mode 100644 index 00000000..cb26b5b1 --- /dev/null +++ b/nodescraper/plugins/inband/rdma/rdmadata.py @@ -0,0 +1,104 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from typing import Optional + +from pydantic import BaseModel, ConfigDict, Field, model_validator +from typing_extensions import Self + +from nodescraper.models import DataModel + + +class RdmaDevice(BaseModel): + """RDMA device from 'rdma dev' (text output).""" + + device: str + node_type: Optional[str] = None + transport: Optional[str] = None + node_guid: Optional[str] = None + sys_image_guid: Optional[str] = None + state: Optional[str] = None + attributes: dict[str, str] = Field(default_factory=dict) + + +class RdmaStatistics(BaseModel): + """RDMA statistic entry from 'rdma statistic -j'.""" + + model_config = ConfigDict(extra="allow") + + ifname: Optional[str] = None + port: Optional[int] = None + + @model_validator(mode="after") + def validate_at_least_one_field(self) -> Self: + if not self.model_fields_set: + raise ValueError("At least one field must be set in RdmaStatistics") + return self + + +class RdmaLink(BaseModel): + """RDMA link entry from 'rdma link -j' (JSON).""" + + ifindex: Optional[int] = None + ifname: Optional[str] = None + port: Optional[int] = None + state: Optional[str] = None + physical_state: Optional[str] = None + netdev: Optional[str] = None + netdev_index: Optional[int] = None + + @model_validator(mode="after") + def validate_at_least_one_field(self) -> Self: + if not self.model_fields_set: + raise ValueError("At least one field must be set in RdmaLink") + return self + + +class RdmaLinkText(BaseModel): + """RDMA link from 'rdma link' (text output).""" + + device: str + port: int + state: Optional[str] = None + physical_state: Optional[str] = None + netdev: Optional[str] = None + attributes: dict[str, str] = Field(default_factory=dict) + + +class RdmaDataModel(DataModel): + """ + Data model for RDMA (Remote Direct Memory Access) statistics and link information. + + Attributes: + statistic_list: List of RDMA statistics from 'rdma statistic -j'. + link_list: List of RDMA links from 'rdma link -j' (JSON). + dev_list: List of RDMA devices from 'rdma dev' (text). + link_list_text: List of RDMA links from 'rdma link' (text). + """ + + link_list: list[RdmaLink] = Field(default_factory=list) + statistic_list: list[RdmaStatistics] = Field(default_factory=list) + dev_list: list[RdmaDevice] = Field(default_factory=list) + link_list_text: list[RdmaLinkText] = Field(default_factory=list) diff --git a/nodescraper/plugins/inband/rocm/collector_args.py b/nodescraper/plugins/inband/rocm/collector_args.py new file mode 100644 index 00000000..a3be0661 --- /dev/null +++ b/nodescraper/plugins/inband/rocm/collector_args.py @@ -0,0 +1,32 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from nodescraper.models import CollectorArgs + + +class RocmCollectorArgs(CollectorArgs): + """Collector arguments for RocmPlugin.""" + + rocm_path: str = "/opt/rocm" diff --git a/nodescraper/plugins/inband/rocm/rocm_collector.py b/nodescraper/plugins/inband/rocm/rocm_collector.py index 31ea149f..7b910a69 100644 --- a/nodescraper/plugins/inband/rocm/rocm_collector.py +++ b/nodescraper/plugins/inband/rocm/rocm_collector.py @@ -31,40 +31,48 @@ from nodescraper.models import TaskResult from nodescraper.utils import strip_ansi_codes +from .collector_args import RocmCollectorArgs from .rocmdata import RocmDataModel -class RocmCollector(InBandDataCollector[RocmDataModel, None]): +class RocmCollector(InBandDataCollector[RocmDataModel, RocmCollectorArgs]): """Collect ROCm version data""" SUPPORTED_OS_FAMILY: set[OSFamily] = {OSFamily.LINUX} DATA_MODEL = RocmDataModel - CMD_VERSION_PATHS = [ - "/opt/rocm/.info/version-rocm", - "/opt/rocm/.info/version", - ] - CMD_ROCM_SUB_VERSIONS = "grep . -r /opt/rocm/.info/*" - CMD_ROCMINFO = "{rocm_path}/bin/rocminfo" - CMD_ROCM_LATEST = "ls -v -d /opt/rocm-[3-7]* | tail -1" - CMD_ROCM_DIRS = "ls -v -d /opt/rocm*" + CMD_ROCM_SUB_VERSIONS_TMPL = "grep . -r {rocm_path}/.info/*" + CMD_ROCMINFO_TMPL = "{rocm_path}/bin/rocminfo" + CMD_ROCM_LATEST_TMPL = "ls -v -d {rocm_path}-[3-7]* | tail -1" + CMD_ROCM_DIRS_TMPL = "ls -v -d {rocm_path}*" CMD_LD_CONF = "grep -i -E 'rocm' /etc/ld.so.conf.d/*" CMD_ROCM_LIBS = "ldconfig -p | grep -i -E 'rocm'" CMD_ENV_VARS = "env | grep -Ei 'rocm|hsa|hip|mpi|openmp|ucx|miopen'" - CMD_CLINFO = "{rocm_path}/opencl/bin/*/clinfo" + CMD_CLINFO_TMPL = "{rocm_path}/opencl/bin/*/clinfo" CMD_KFD_PROC = "ls /sys/class/kfd/kfd/proc/" - def collect_data(self, args=None) -> tuple[TaskResult, Optional[RocmDataModel]]: + def collect_data( + self, args: Optional[RocmCollectorArgs] = None + ) -> tuple[TaskResult, Optional[RocmDataModel]]: """Collect ROCm version data from the system. Returns: tuple[TaskResult, Optional[RocmDataModel]]: tuple containing the task result and ROCm data model if available. """ + if args is None: + args = RocmCollectorArgs() + version_paths = [ + f"{args.rocm_path}/.info/version-rocm", + f"{args.rocm_path}/.info/version", + ] + rocm_data = None rocm_sub_versions = {} # First, try to collect all sub-versions - sub_versions_res = self._run_sut_cmd(self.CMD_ROCM_SUB_VERSIONS) + sub_versions_res = self._run_sut_cmd( + self.CMD_ROCM_SUB_VERSIONS_TMPL.format(rocm_path=args.rocm_path) + ) if sub_versions_res.exit_code == 0: for line in sub_versions_res.stdout.splitlines(): if ":" in line: @@ -75,7 +83,7 @@ def collect_data(self, args=None) -> tuple[TaskResult, Optional[RocmDataModel]]: rocm_sub_versions[key.strip()] = value.strip() # Determine the main ROCm version - for path in self.CMD_VERSION_PATHS: + for path in version_paths: res = self._run_sut_cmd(f"grep . {path}") if res.exit_code == 0: try: @@ -105,7 +113,7 @@ def collect_data(self, args=None) -> tuple[TaskResult, Optional[RocmDataModel]]: else: self._log_event( category=EventCategory.OS, - description=f"Unable to read ROCm version from {self.CMD_VERSION_PATHS}", + description=f"Unable to read ROCm version from {version_paths}", data={"raw_output": res.stdout}, priority=EventPriority.ERROR, ) @@ -113,12 +121,16 @@ def collect_data(self, args=None) -> tuple[TaskResult, Optional[RocmDataModel]]: # Collect additional ROCm data if version was found if rocm_data: # Collect latest versioned ROCm path (rocm-[3-7]*) - versioned_path_res = self._run_sut_cmd(self.CMD_ROCM_LATEST) + versioned_path_res = self._run_sut_cmd( + self.CMD_ROCM_LATEST_TMPL.format(rocm_path=args.rocm_path) + ) if versioned_path_res.exit_code == 0: rocm_data.rocm_latest_versioned_path = versioned_path_res.stdout.strip() # Collect all ROCm paths as list - all_paths_res = self._run_sut_cmd(self.CMD_ROCM_DIRS) + all_paths_res = self._run_sut_cmd( + self.CMD_ROCM_DIRS_TMPL.format(rocm_path=args.rocm_path) + ) if all_paths_res.exit_code == 0: rocm_data.rocm_all_paths = [ path.strip() @@ -126,11 +138,8 @@ def collect_data(self, args=None) -> tuple[TaskResult, Optional[RocmDataModel]]: if path.strip() ] - # Determine ROCm path for commands that need it - rocm_path = rocm_data.rocm_latest_versioned_path or "/opt/rocm" - # Collect rocminfo output as list of lines with ANSI codes stripped - rocminfo_cmd = self.CMD_ROCMINFO.format(rocm_path=rocm_path) + rocminfo_cmd = self.CMD_ROCMINFO_TMPL.format(rocm_path=args.rocm_path) rocminfo_res = self._run_sut_cmd(rocminfo_cmd) rocminfo_artifact_content = "" if rocminfo_res.exit_code == 0: @@ -167,7 +176,7 @@ def collect_data(self, args=None) -> tuple[TaskResult, Optional[RocmDataModel]]: ] # Collect clinfo output - clinfo_cmd = self.CMD_CLINFO.format(rocm_path=rocm_path) + clinfo_cmd = self.CMD_CLINFO_TMPL.format(rocm_path=args.rocm_path) clinfo_res = self._run_sut_cmd(clinfo_cmd) # Always append clinfo section to artifact, even if empty or failed diff --git a/nodescraper/plugins/inband/rocm/rocm_plugin.py b/nodescraper/plugins/inband/rocm/rocm_plugin.py index 9a3cfa3d..b80db0cc 100644 --- a/nodescraper/plugins/inband/rocm/rocm_plugin.py +++ b/nodescraper/plugins/inband/rocm/rocm_plugin.py @@ -26,18 +26,21 @@ from nodescraper.base import InBandDataPlugin from .analyzer_args import RocmAnalyzerArgs +from .collector_args import RocmCollectorArgs from .rocm_analyzer import RocmAnalyzer from .rocm_collector import RocmCollector from .rocmdata import RocmDataModel -class RocmPlugin(InBandDataPlugin[RocmDataModel, None, RocmAnalyzerArgs]): +class RocmPlugin(InBandDataPlugin[RocmDataModel, RocmCollectorArgs, RocmAnalyzerArgs]): """Plugin for collection and analysis of rocm version data""" DATA_MODEL = RocmDataModel COLLECTOR = RocmCollector + COLLECTOR_ARGS = RocmCollectorArgs + ANALYZER = RocmAnalyzer ANALYZER_ARGS = RocmAnalyzerArgs diff --git a/nodescraper/plugins/inband/sys_settings/collector_args.py b/nodescraper/plugins/inband/sys_settings/collector_args.py index b1d2860c..f2e73ddd 100644 --- a/nodescraper/plugins/inband/sys_settings/collector_args.py +++ b/nodescraper/plugins/inband/sys_settings/collector_args.py @@ -29,7 +29,7 @@ class SysSettingsCollectorArgs(BaseModel): """Collection args for SysSettingsCollector. - paths: sysfs paths to read (cat). + paths: sysfs paths to read (cat). If a path contains '*', collect with ls -l instead (e.g. class/net/*/device). directory_paths: sysfs paths to list (ls -1); use for checks that match entry names by regex. """ diff --git a/nodescraper/plugins/inband/sys_settings/sys_settings_collector.py b/nodescraper/plugins/inband/sys_settings/sys_settings_collector.py index 753abbd0..af51a021 100644 --- a/nodescraper/plugins/inband/sys_settings/sys_settings_collector.py +++ b/nodescraper/plugins/inband/sys_settings/sys_settings_collector.py @@ -33,7 +33,6 @@ from .collector_args import SysSettingsCollectorArgs from .sys_settings_data import SysSettingsDataModel -# Sysfs format: "[always] madvise never" -> extract bracketed value BRACKETED_RE = re.compile(r"\[(\w+)\]") @@ -103,6 +102,7 @@ class SysSettingsCollector(InBandDataCollector[SysSettingsDataModel, SysSettings CMD = "cat /sys/{}" CMD_LS = "ls -1 /sys/{}" + CMD_LS_LONG = "ls -l /sys/{}" def collect_data( self, args: Optional[SysSettingsCollectorArgs] = None @@ -152,18 +152,32 @@ def collect_data( ) continue full_path = _sysfs_full_path(suffix) - res = self._run_sut_cmd(self.CMD.format(suffix), sudo=False) - if res.exit_code == 0 and res.stdout: - value = _parse_bracketed_setting(res.stdout) or res.stdout.strip() - readings[full_path] = value + if "*" in suffix: + cmd = self.CMD_LS_LONG.format(suffix) + res = self._run_sut_cmd(f"bash -c {cmd!r}", sudo=False) + if res.exit_code == 0: + readings[full_path] = res.stdout.strip() if res.stdout else "" + else: + self._log_event( + category=EventCategory.OS, + description=f"Failed to run ls -l for sysfs path: {full_path}", + data={"exit_code": res.exit_code}, + priority=EventPriority.WARNING, + console_log=True, + ) else: - self._log_event( - category=EventCategory.OS, - description=f"Failed to read sysfs path: {full_path}", - data={"exit_code": res.exit_code}, - priority=EventPriority.WARNING, - console_log=True, - ) + res = self._run_sut_cmd(self.CMD.format(suffix), sudo=False) + if res.exit_code == 0 and res.stdout: + value = _parse_bracketed_setting(res.stdout) or res.stdout.strip() + readings[full_path] = value + else: + self._log_event( + category=EventCategory.OS, + description=f"Failed to read sysfs path: {full_path}", + data={"exit_code": res.exit_code}, + priority=EventPriority.WARNING, + console_log=True, + ) for path in directory_paths: suffix = _path_under_sys(path) diff --git a/nodescraper/plugins/ooband/__init__.py b/nodescraper/plugins/ooband/__init__.py new file mode 100644 index 00000000..b75b2eb0 --- /dev/null +++ b/nodescraper/plugins/ooband/__init__.py @@ -0,0 +1,26 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +"""Out-of-band (OOB) plugins: Redfish and other BMC/remote management plugins.""" diff --git a/nodescraper/plugins/ooband/redfish_endpoint/__init__.py b/nodescraper/plugins/ooband/redfish_endpoint/__init__.py new file mode 100644 index 00000000..84293e72 --- /dev/null +++ b/nodescraper/plugins/ooband/redfish_endpoint/__init__.py @@ -0,0 +1,40 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from .analyzer_args import RedfishEndpointAnalyzerArgs +from .collector_args import RedfishEndpointCollectorArgs +from .endpoint_analyzer import RedfishEndpointAnalyzer +from .endpoint_collector import RedfishEndpointCollector +from .endpoint_data import RedfishEndpointDataModel +from .endpoint_plugin import RedfishEndpointPlugin + +__all__ = [ + "RedfishEndpointAnalyzer", + "RedfishEndpointAnalyzerArgs", + "RedfishEndpointCollector", + "RedfishEndpointCollectorArgs", + "RedfishEndpointDataModel", + "RedfishEndpointPlugin", +] diff --git a/nodescraper/plugins/ooband/redfish_endpoint/analyzer_args.py b/nodescraper/plugins/ooband/redfish_endpoint/analyzer_args.py new file mode 100644 index 00000000..9162980e --- /dev/null +++ b/nodescraper/plugins/ooband/redfish_endpoint/analyzer_args.py @@ -0,0 +1,67 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from enum import Enum +from typing import Any, Union + +from pydantic import Field + +from nodescraper.models import AnalyzerArgs + + +class ConstraintKey(str, Enum): + """Keys used in Redfish constraint dicts (e.g. in analyzer checks config). + + Naming aligns with JSON Schema combining: anyOf = value must match any of the list (OR). + oneOf in JSON Schema means exactly one (XOR); we use anyOf for \"value in allowed list\". + """ + + EQ = "eq" + MIN = "min" + MAX = "max" + ANY_OF = "anyOf" + + +RedfishConstraint = Union[int, float, str, bool, dict[str, Any]] + + +class RedfishEndpointAnalyzerArgs(AnalyzerArgs): + """Analyzer args for config-driven Redfish checks.""" + + checks: dict[str, dict[str, RedfishConstraint]] = Field( + default_factory=dict, + description=( + "Map: URI or '*' -> { property_path: constraint }. " + "URI keys must match a key in the collected responses (exact match). " + "Use '*' as the key to apply the inner constraints to every collected response body. " + "Property paths use '/' for nesting and indices, e.g. 'Status/Health', 'PowerControl/0/PowerConsumedWatts'. " + "Constraints: " + "'eq' — value must equal the given literal (int, float, str, bool). " + "'min' — value must be numeric and >= the given number. " + "'max' — value must be numeric and <= the given number. " + "'anyOf' — value must be in the given list (OR; any match passes). " + 'Example: { "/redfish/v1/Systems/1": { "Status/Health": { "anyOf": ["OK", "Warning"] }, "PowerState": "On" }, "*": { "Status/Health": { "anyOf": ["OK"] } } }.' + ), + ) diff --git a/nodescraper/plugins/ooband/redfish_endpoint/collector_args.py b/nodescraper/plugins/ooband/redfish_endpoint/collector_args.py new file mode 100644 index 00000000..c63cd8db --- /dev/null +++ b/nodescraper/plugins/ooband/redfish_endpoint/collector_args.py @@ -0,0 +1,40 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from pydantic import BaseModel, Field, field_validator + + +class RedfishEndpointCollectorArgs(BaseModel): + """Collection args: uris to GET.""" + + uris: list[str] = Field(default_factory=list) + + @field_validator("uris", mode="before") + @classmethod + def strip_uris(cls, v: list[str]) -> list[str]: + """Strip whitespace from each URI in the list.""" + if not v: + return v + return [str(uri).strip() for uri in v] diff --git a/nodescraper/plugins/ooband/redfish_endpoint/endpoint_analyzer.py b/nodescraper/plugins/ooband/redfish_endpoint/endpoint_analyzer.py new file mode 100644 index 00000000..6323a437 --- /dev/null +++ b/nodescraper/plugins/ooband/redfish_endpoint/endpoint_analyzer.py @@ -0,0 +1,145 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from typing import Any, Optional + +from nodescraper.enums import EventCategory, EventPriority, ExecutionStatus +from nodescraper.interfaces import DataAnalyzer +from nodescraper.models import TaskResult + +from .analyzer_args import ConstraintKey, RedfishConstraint, RedfishEndpointAnalyzerArgs +from .endpoint_data import RedfishEndpointDataModel + + +def _get_by_path(obj: Any, path: str) -> Any: + """Get a value from a nested dict/list.""" + if not path.strip(): + return obj + current: Any = obj + for segment in path.strip().split("/"): + if not segment: + continue + if current is None: + return None + if isinstance(current, list): + try: + idx = int(segment) + current = current[idx] if 0 <= idx < len(current) else None + except ValueError: + return None + elif isinstance(current, dict): + current = current.get(segment) + else: + return None + return current + + +def _check_constraint(actual: Any, constraint: RedfishConstraint) -> tuple[bool, str]: + """Compare actual value to constraint.""" + if isinstance(constraint, dict): + if ConstraintKey.EQ in constraint: + ok = actual == constraint[ConstraintKey.EQ] + return ok, f"expected eq {constraint[ConstraintKey.EQ]}, got {actual!r}" + if ConstraintKey.MIN in constraint or ConstraintKey.MAX in constraint: + try: + val = float(actual) if actual is not None else None + if val is None: + return False, f"expected numeric, got {actual!r}" + if ConstraintKey.MIN in constraint and val < constraint[ConstraintKey.MIN]: + return False, f"value {val} below min {constraint[ConstraintKey.MIN]}" + if ConstraintKey.MAX in constraint and val > constraint[ConstraintKey.MAX]: + return False, f"value {val} above max {constraint[ConstraintKey.MAX]}" + return True, "" + except (TypeError, ValueError): + return False, f"expected numeric, got {actual!r}" + if ConstraintKey.ANY_OF in constraint: + allowed = constraint[ConstraintKey.ANY_OF] + if not isinstance(allowed, list): + return False, "anyOf must be a list" + ok = actual in allowed + return ok, f"expected any of {allowed}, got {actual!r}" + ok = actual == constraint + return ok, f"expected {constraint!r}, got {actual!r}" + + +class RedfishEndpointAnalyzer(DataAnalyzer[RedfishEndpointDataModel, RedfishEndpointAnalyzerArgs]): + """Checks Redfish endpoint responses against configured thresholds and key/value rules.""" + + DATA_MODEL = RedfishEndpointDataModel + + def analyze_data( + self, + data: RedfishEndpointDataModel, + args: Optional[RedfishEndpointAnalyzerArgs] = None, + ) -> TaskResult: + """Evaluate each configured check against the collected Redfish responses.""" + if not args or not args.checks: + self.result.status = ExecutionStatus.OK + self.result.message = "No checks configured" + return self.result + + failed: list[dict[str, Any]] = [] + for uri, path_constraints in args.checks.items(): + if uri == "*": + bodies = list(data.responses.values()) + else: + body = data.responses.get(uri) + bodies = [body] if body is not None else [] + if not bodies: + if uri != "*": + failed.append( + {"uri": uri, "path": None, "reason": "URI not in collected responses"} + ) + continue + for resp in bodies: + for path, constraint in path_constraints.items(): + actual = _get_by_path(resp, path) + ok, msg = _check_constraint(actual, constraint) + if not ok: + failed.append( + { + "uri": uri, + "path": path, + "expected": constraint, + "actual": actual, + "reason": msg, + } + ) + + if failed: + details = "; ".join(f"{f['uri']} {f['path']}: {f['reason']}" for f in failed) + self._log_event( + category=EventCategory.TELEMETRY, + description=f"Redfish endpoint checks failed: {len(failed)} failure(s) — {details}", + data={"failures": failed}, + priority=EventPriority.WARNING, + console_log=True, + ) + self.result.status = ExecutionStatus.ERROR + self.result.message = f"{len(failed)} check(s) failed" + else: + self.result.status = ExecutionStatus.OK + self.result.message = "All Redfish endpoint checks passed" + return self.result diff --git a/nodescraper/plugins/ooband/redfish_endpoint/endpoint_collector.py b/nodescraper/plugins/ooband/redfish_endpoint/endpoint_collector.py new file mode 100644 index 00000000..39dacf79 --- /dev/null +++ b/nodescraper/plugins/ooband/redfish_endpoint/endpoint_collector.py @@ -0,0 +1,86 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from typing import Optional + +from nodescraper.base import RedfishDataCollector +from nodescraper.enums import EventCategory, EventPriority, ExecutionStatus +from nodescraper.models import TaskResult + +from .collector_args import RedfishEndpointCollectorArgs +from .endpoint_data import RedfishEndpointDataModel + + +def _uris_from_args(args: Optional[RedfishEndpointCollectorArgs]) -> list[str]: + """Return list of URIs from collector args.uris.""" + if args is None: + return [] + return list(args.uris) if args.uris else [] + + +class RedfishEndpointCollector( + RedfishDataCollector[RedfishEndpointDataModel, RedfishEndpointCollectorArgs] +): + """Collects Redfish endpoint responses for URIs specified in config.""" + + DATA_MODEL = RedfishEndpointDataModel + + def collect_data( + self, args: Optional[RedfishEndpointCollectorArgs] = None + ) -> tuple[TaskResult, Optional[RedfishEndpointDataModel]]: + """GET each configured Redfish URI via _run_redfish_get() and store the JSON response.""" + uris = _uris_from_args(args) + if not uris: + self.result.message = "No Redfish URIs configured" + self.result.status = ExecutionStatus.NOT_RAN + return self.result, None + + responses: dict[str, dict] = {} + for uri in uris: + path = uri + if not path: + continue + if not path.startswith("/"): + path = "/" + path + res = self._run_redfish_get(path, log_artifact=True) + if res.success and res.data is not None: + responses[res.path] = res.data + else: + self._log_event( + category=EventCategory.RUNTIME, + description=f"Redfish GET failed for {path}: {res.error or 'unknown'}", + priority=EventPriority.WARNING, + console_log=True, + ) + + if not responses: + self.result.message = "No Redfish endpoints could be read" + self.result.status = ExecutionStatus.ERROR + return self.result, None + + data = RedfishEndpointDataModel(responses=responses) + self.result.message = f"Collected {len(responses)} Redfish endpoint(s)" + self.result.status = ExecutionStatus.OK + return self.result, data diff --git a/nodescraper/plugins/ooband/redfish_endpoint/endpoint_data.py b/nodescraper/plugins/ooband/redfish_endpoint/endpoint_data.py new file mode 100644 index 00000000..19145485 --- /dev/null +++ b/nodescraper/plugins/ooband/redfish_endpoint/endpoint_data.py @@ -0,0 +1,34 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from pydantic import Field + +from nodescraper.models import DataModel + + +class RedfishEndpointDataModel(DataModel): + """Collected Redfish endpoint responses: URI -> JSON body.""" + + responses: dict[str, dict] = Field(default_factory=dict) diff --git a/nodescraper/plugins/ooband/redfish_endpoint/endpoint_plugin.py b/nodescraper/plugins/ooband/redfish_endpoint/endpoint_plugin.py new file mode 100644 index 00000000..029bcc88 --- /dev/null +++ b/nodescraper/plugins/ooband/redfish_endpoint/endpoint_plugin.py @@ -0,0 +1,48 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from nodescraper.base import OOBandDataPlugin + +from .analyzer_args import RedfishEndpointAnalyzerArgs +from .collector_args import RedfishEndpointCollectorArgs +from .endpoint_analyzer import RedfishEndpointAnalyzer +from .endpoint_collector import RedfishEndpointCollector +from .endpoint_data import RedfishEndpointDataModel + + +class RedfishEndpointPlugin( + OOBandDataPlugin[ + RedfishEndpointDataModel, + RedfishEndpointCollectorArgs, + RedfishEndpointAnalyzerArgs, + ] +): + """Config-driven plugin: collect from Redfish URIs and check against thresholds/key-values.""" + + DATA_MODEL = RedfishEndpointDataModel + COLLECTOR = RedfishEndpointCollector + ANALYZER = RedfishEndpointAnalyzer + COLLECTOR_ARGS = RedfishEndpointCollectorArgs + ANALYZER_ARGS = RedfishEndpointAnalyzerArgs diff --git a/nodescraper/plugins/ooband/redfish_oem_diag/__init__.py b/nodescraper/plugins/ooband/redfish_oem_diag/__init__.py new file mode 100644 index 00000000..1a761c81 --- /dev/null +++ b/nodescraper/plugins/ooband/redfish_oem_diag/__init__.py @@ -0,0 +1,22 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +############################################################################### +from .analyzer_args import RedfishOemDiagAnalyzerArgs +from .collector_args import RedfishOemDiagCollectorArgs +from .oem_diag_analyzer import RedfishOemDiagAnalyzer +from .oem_diag_collector import RedfishOemDiagCollector +from .oem_diag_data import RedfishOemDiagDataModel +from .oem_diag_plugin import RedfishOemDiagPlugin + +__all__ = [ + "RedfishOemDiagAnalyzer", + "RedfishOemDiagAnalyzerArgs", + "RedfishOemDiagCollector", + "RedfishOemDiagCollectorArgs", + "RedfishOemDiagDataModel", + "RedfishOemDiagPlugin", +] diff --git a/nodescraper/plugins/ooband/redfish_oem_diag/analyzer_args.py b/nodescraper/plugins/ooband/redfish_oem_diag/analyzer_args.py new file mode 100644 index 00000000..60759a94 --- /dev/null +++ b/nodescraper/plugins/ooband/redfish_oem_diag/analyzer_args.py @@ -0,0 +1,37 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from pydantic import Field + +from nodescraper.models import AnalyzerArgs + + +class RedfishOemDiagAnalyzerArgs(AnalyzerArgs): + """Analyzer args for Redfish OEM diagnostic log results.""" + + require_all_success: bool = Field( + default=False, + description="If True, analysis fails when any OEM type collection failed.", + ) diff --git a/nodescraper/plugins/ooband/redfish_oem_diag/collector_args.py b/nodescraper/plugins/ooband/redfish_oem_diag/collector_args.py new file mode 100644 index 00000000..da5bd50c --- /dev/null +++ b/nodescraper/plugins/ooband/redfish_oem_diag/collector_args.py @@ -0,0 +1,63 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from __future__ import annotations + +from typing import Optional + +from pydantic import BaseModel, Field, model_validator + +DEFAULT_TASK_TIMEOUT_S = 1800 + + +class RedfishOemDiagCollectorArgs(BaseModel): + """Collector/analyzer args for Redfish OEM diagnostic log collection.""" + + log_service_path: str = Field( + default="redfish/v1/Systems/UBB/LogServices/DiagLogs", + description="Redfish path to the LogService (e.g. DiagLogs).", + ) + oem_diagnostic_types_allowable: Optional[list[str]] = Field( + default=None, + description="Allowable OEM diagnostic types for this architecture/BMC. When set, used for validation and as default for oem_diagnostic_types when empty.", + ) + oem_diagnostic_types: list[str] = Field( + default_factory=list, + description="OEM diagnostic types to collect. When empty and oem_diagnostic_types_allowable is set, defaults to that list.", + ) + task_timeout_s: int = Field( + default=DEFAULT_TASK_TIMEOUT_S, + ge=1, + le=3600, + description="Max seconds to wait for each BMC task.", + ) + + @model_validator(mode="after") + def _default_oem_diagnostic_types(self) -> RedfishOemDiagCollectorArgs: + if not self.oem_diagnostic_types and self.oem_diagnostic_types_allowable: + return self.model_copy( + update={"oem_diagnostic_types": list(self.oem_diagnostic_types_allowable)} + ) + return self diff --git a/nodescraper/plugins/ooband/redfish_oem_diag/oem_diag_analyzer.py b/nodescraper/plugins/ooband/redfish_oem_diag/oem_diag_analyzer.py new file mode 100644 index 00000000..c54d9e2f --- /dev/null +++ b/nodescraper/plugins/ooband/redfish_oem_diag/oem_diag_analyzer.py @@ -0,0 +1,69 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from typing import Optional + +from nodescraper.enums import EventCategory, EventPriority, ExecutionStatus +from nodescraper.interfaces import DataAnalyzer +from nodescraper.models import TaskResult + +from .analyzer_args import RedfishOemDiagAnalyzerArgs +from .oem_diag_data import RedfishOemDiagDataModel + + +class RedfishOemDiagAnalyzer(DataAnalyzer[RedfishOemDiagDataModel, RedfishOemDiagAnalyzerArgs]): + """Analyzes Redfish OEM diagnostic log collection results.""" + + DATA_MODEL = RedfishOemDiagDataModel + + def analyze_data( + self, + data: RedfishOemDiagDataModel, + args: Optional[RedfishOemDiagAnalyzerArgs] = None, + ) -> TaskResult: + """Check collection results; optionally fail if any type failed.""" + if not data.results: + self.result.message = "No OEM diagnostic results to analyze" + self.result.status = ExecutionStatus.NOT_RAN + return self.result + + failed = [t for t, r in data.results.items() if not r.success] + success_count = len(data.results) - len(failed) + + if args and args.require_all_success and failed: + self._log_event( + category=EventCategory.RUNTIME, + description=f"OEM diag types failed: {failed}", + data={t: data.results[t].error for t in failed}, + priority=EventPriority.ERROR, + console_log=True, + ) + self.result.message = f"OEM diag: {len(failed)} type(s) failed: {failed}" + self.result.status = ExecutionStatus.ERROR + return self.result + + self.result.message = f"OEM diag: {success_count}/{len(data.results)} types collected" + self.result.status = ExecutionStatus.OK + return self.result diff --git a/nodescraper/plugins/ooband/redfish_oem_diag/oem_diag_collector.py b/nodescraper/plugins/ooband/redfish_oem_diag/oem_diag_collector.py new file mode 100644 index 00000000..b406ef38 --- /dev/null +++ b/nodescraper/plugins/ooband/redfish_oem_diag/oem_diag_collector.py @@ -0,0 +1,109 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from pathlib import Path +from typing import Any, Optional + +from nodescraper.base import RedfishDataCollector +from nodescraper.connection.redfish import collect_oem_diagnostic_data +from nodescraper.enums import EventCategory, EventPriority, ExecutionStatus +from nodescraper.models import TaskResult +from nodescraper.utils import pascal_to_snake + +from .collector_args import RedfishOemDiagCollectorArgs +from .oem_diag_data import OemDiagTypeResult, RedfishOemDiagDataModel + + +class RedfishOemDiagCollector( + RedfishDataCollector[RedfishOemDiagDataModel, RedfishOemDiagCollectorArgs] +): + """Collects Redfish OEM diagnostic logs (e.g. JournalControl, AllLogs) via LogService.CollectDiagnosticData.""" + + DATA_MODEL = RedfishOemDiagDataModel + + def __init__(self, *args: Any, **kwargs: Any) -> None: + self.log_path = kwargs.pop("log_path", None) + super().__init__(*args, **kwargs) + + def collect_data( + self, args: Optional[RedfishOemDiagCollectorArgs] = None + ) -> tuple[TaskResult, Optional[RedfishOemDiagDataModel]]: + """Run OEM diagnostic collection for each type in args.oem_diagnostic_types.""" + if args is None: + args = RedfishOemDiagCollectorArgs() + types_to_collect = list(args.oem_diagnostic_types) if args.oem_diagnostic_types else [] + if not types_to_collect: + self.result.message = "No OEM diagnostic types configured" + self.result.status = ExecutionStatus.NOT_RAN + return self.result, None + + if self.log_path: + output_dir = ( + Path(self.log_path) + / pascal_to_snake(self.parent or "") + / pascal_to_snake(self.__class__.__name__) + / "diag_logs" + ).resolve() + else: + output_dir = None + + if output_dir is not None: + output_dir.mkdir(parents=True, exist_ok=True) + self.logger.info( + "(RedfishOemDiagPlugin) Diagnostic archives (e.g. *.tar.xz) will be written to: %s", + output_dir, + ) + + results: dict[str, OemDiagTypeResult] = {} + validate = bool(args.oem_diagnostic_types_allowable) + gathered = [] + for oem_type in types_to_collect: + log_bytes, metadata, err = collect_oem_diagnostic_data( + self.connection, + log_service_path=args.log_service_path, + oem_diagnostic_type=oem_type, + task_timeout_s=args.task_timeout_s, + output_dir=output_dir, + validate_type=validate, + allowed_types=args.oem_diagnostic_types_allowable, + logger=self.logger, + ) + gathered.append((oem_type, log_bytes, metadata, err)) + for oem_type, _log_bytes, metadata, err in gathered: + if err: + self._log_event( + category=EventCategory.RUNTIME, + description=f"OEM diag {oem_type!r}: {err}", + priority=EventPriority.WARNING, + console_log=True, + ) + results[oem_type] = OemDiagTypeResult(success=False, error=err, metadata=None) + else: + results[oem_type] = OemDiagTypeResult(success=True, error=None, metadata=metadata) + + success_count = sum(1 for r in results.values() if r.success) + self.result.message = f"OEM diag: {success_count}/{len(results)} types collected" + self.result.status = ExecutionStatus.OK if success_count else ExecutionStatus.ERROR + return self.result, RedfishOemDiagDataModel(results=results) diff --git a/nodescraper/plugins/ooband/redfish_oem_diag/oem_diag_data.py b/nodescraper/plugins/ooband/redfish_oem_diag/oem_diag_data.py new file mode 100644 index 00000000..db3f5b59 --- /dev/null +++ b/nodescraper/plugins/ooband/redfish_oem_diag/oem_diag_data.py @@ -0,0 +1,44 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from typing import Any, Optional + +from pydantic import Field + +from nodescraper.models import DataModel + + +class OemDiagTypeResult(DataModel): + """Result of collecting one OEM diagnostic log type (no raw bytes; JSON-serializable).""" + + success: bool = False + error: Optional[str] = None + metadata: Optional[dict[str, Any]] = None + + +class RedfishOemDiagDataModel(DataModel): + """Collected Redfish OEM diagnostic log results: OEM type -> result (success, error, metadata).""" + + results: dict[str, OemDiagTypeResult] = Field(default_factory=dict) diff --git a/nodescraper/plugins/ooband/redfish_oem_diag/oem_diag_plugin.py b/nodescraper/plugins/ooband/redfish_oem_diag/oem_diag_plugin.py new file mode 100644 index 00000000..6fd2ee5a --- /dev/null +++ b/nodescraper/plugins/ooband/redfish_oem_diag/oem_diag_plugin.py @@ -0,0 +1,52 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from nodescraper.base import OOBandDataPlugin + +from .analyzer_args import RedfishOemDiagAnalyzerArgs +from .collector_args import RedfishOemDiagCollectorArgs +from .oem_diag_analyzer import RedfishOemDiagAnalyzer +from .oem_diag_collector import RedfishOemDiagCollector +from .oem_diag_data import RedfishOemDiagDataModel + + +class RedfishOemDiagPlugin( + OOBandDataPlugin[ + RedfishOemDiagDataModel, + RedfishOemDiagCollectorArgs, + RedfishOemDiagAnalyzerArgs, + ] +): + """Collect Redfish OEM diagnostic logs (e.g. JournalControl, AllLogs, Dmesg) via LogService.CollectDiagnosticData. + + Uses RedfishConnectionManager. Configure log_service_path, oem_diagnostic_types (and optional output_dir) + in collection_args; use analysis_args.require_all_success to fail if any type fails. + """ + + DATA_MODEL = RedfishOemDiagDataModel + COLLECTOR = RedfishOemDiagCollector + ANALYZER = RedfishOemDiagAnalyzer + COLLECTOR_ARGS = RedfishOemDiagCollectorArgs + ANALYZER_ARGS = RedfishOemDiagAnalyzerArgs diff --git a/nodescraper/utils.py b/nodescraper/utils.py index 96dd093a..3b9edf34 100644 --- a/nodescraper/utils.py +++ b/nodescraper/utils.py @@ -28,7 +28,18 @@ import re import traceback from enum import Enum -from typing import Any, List, Optional, Set, Type, TypeVar, Union, get_args, get_origin +from typing import ( + Any, + Dict, + List, + Optional, + Set, + Type, + TypeVar, + Union, + get_args, + get_origin, +) T = TypeVar("T") @@ -90,6 +101,28 @@ def str_or_none(val: object) -> Optional[str]: return s if s else None +MAX_STDERR_STDOUT_LENGTH_IN_EVENT = 4096 + + +def has_command_error_output(stderr: str, stdout: str) -> bool: + """True if the tool reported anything to stderr (errors are typically written to stderr).""" + return bool((stderr or "").strip()) + + +def command_result_event_data( + res: Any, + max_length: int = MAX_STDERR_STDOUT_LENGTH_IN_EVENT, +) -> Dict[str, Any]: + """Build event data dict from a command result (stderr and optionally stdout).""" + stderr = (getattr(res, "stderr", None) or "")[:max_length] + exit_code = getattr(res, "exit_code", None) + data: Dict[str, Any] = {"exit_code": exit_code, "stderr": stderr} + stdout = getattr(res, "stdout", None) or "" + if stdout and (exit_code != 0 or (stderr or "").strip()): + data["stdout"] = stdout[:max_length] + return data + + def convert_to_bytes(value: str, si=False) -> int: """ Convert human-readable memory sizes (like GB, MB) to bytes. diff --git a/test/functional/fixtures/nic_plugin_config_full_analyzer_args.json b/test/functional/fixtures/nic_plugin_config_full_analyzer_args.json new file mode 100644 index 00000000..f1cc1668 --- /dev/null +++ b/test/functional/fixtures/nic_plugin_config_full_analyzer_args.json @@ -0,0 +1,37 @@ +{ + "name": "NicPlugin config with all analyzer_args", + "desc": "NicPlugin check.", + "global_args": {}, + "plugins": { + "NicPlugin": { + "collection_args": {}, + "analysis_args": { + "expected_values": { + "niccli_list": {"require_success": true}, + "niccli_list_devices": {"require_success": true} + }, + "performance_profile_expected": "RoCE", + "support_rdma_disabled_values": ["0", "false", "disabled", "no", "off"], + "pcie_relaxed_ordering_expected": "enabled", + "expected_qos_prio_map": { + "0": 0, + "1": 1, + "2": 0, + "3": 1, + "4": 0, + "5": 1, + "6": 0, + "7": 1 + }, + "expected_qos_pfc_enabled": 255, + "expected_qos_tsa_map": { + "0": "ets", + "1": "ets" + }, + "expected_qos_tc_bandwidth": [50, 50], + "require_qos_consistent_across_adapters": true + } + } + }, + "result_collators": {} +} diff --git a/test/functional/fixtures/niccli_plugin_config.json b/test/functional/fixtures/niccli_plugin_config.json new file mode 100644 index 00000000..b1ea0f8a --- /dev/null +++ b/test/functional/fixtures/niccli_plugin_config.json @@ -0,0 +1,19 @@ +{ + "name": "NicPlugin config", + "desc": "Config for NicPlugin", + "global_args": {}, + "plugins": { + "NicPlugin": { + "collection_args": {}, + "analysis_args": { + "performance_profile_expected": "RoCE", + "support_rdma_disabled_values": ["0", "false", "disabled", "no", "off"], + "pcie_relaxed_ordering_expected": "enabled", + "expected_qos_pfc_enabled": 255, + "expected_qos_tc_bandwidth": [50, 50], + "require_qos_consistent_across_adapters": true + } + } + }, + "result_collators": {} +} diff --git a/test/functional/fixtures/rdma_plugin_config.json b/test/functional/fixtures/rdma_plugin_config.json new file mode 100644 index 00000000..f62214b3 --- /dev/null +++ b/test/functional/fixtures/rdma_plugin_config.json @@ -0,0 +1,9 @@ +{ + "global_args": {}, + "plugins": { + "RdmaPlugin": {} + }, + "result_collators": {}, + "name": "RdmaPlugin config", + "desc": "Config for testing RdmaPlugin" + } diff --git a/test/functional/fixtures/redfish_connection_config.json b/test/functional/fixtures/redfish_connection_config.json new file mode 100644 index 00000000..6a0475b9 --- /dev/null +++ b/test/functional/fixtures/redfish_connection_config.json @@ -0,0 +1,9 @@ +{ + "RedfishConnectionManager": { + "host": "https://bmc.example.com", + "username": "ADMIN", + "password": "placeholder", + "verify_ssl": false, + "timeout_seconds": 30 + } +} diff --git a/test/functional/fixtures/redfish_endpoint_plugin_config.json b/test/functional/fixtures/redfish_endpoint_plugin_config.json new file mode 100644 index 00000000..f79cf256 --- /dev/null +++ b/test/functional/fixtures/redfish_endpoint_plugin_config.json @@ -0,0 +1,12 @@ +{ + "plugins": { + "RedfishEndpointPlugin": { + "collection_args": { + "uris": ["/redfish/v1", "/redfish/v1/Systems"] + }, + "analysis_args": { + "checks": {} + } + } + } +} diff --git a/test/functional/test_cli_help.py b/test/functional/test_cli_help.py index bf815e5d..a1bd90ae 100644 --- a/test/functional/test_cli_help.py +++ b/test/functional/test_cli_help.py @@ -56,20 +56,6 @@ def test_help_command_long_form(): assert "node scraper" in result.stdout.lower() -def test_no_arguments(): - """Test that node-scraper with no arguments runs the default config.""" - result = subprocess.run( - [sys.executable, "-m", "nodescraper.cli.cli"], - capture_output=True, - text=True, - timeout=120, - ) - - assert len(result.stdout) > 0 or len(result.stderr) > 0 - output = (result.stdout + result.stderr).lower() - assert "plugin" in output or "nodescraper" in output - - def test_help_shows_subcommands(): """Test that help output includes available subcommands.""" result = subprocess.run( diff --git a/test/functional/test_nic_plugin.py b/test/functional/test_nic_plugin.py new file mode 100644 index 00000000..ed9d28f2 --- /dev/null +++ b/test/functional/test_nic_plugin.py @@ -0,0 +1,137 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### + +from pathlib import Path + +import pytest + + +@pytest.fixture +def fixtures_dir(): + """Return path to fixtures directory.""" + return Path(__file__).parent / "fixtures" + + +@pytest.fixture +def nic_plugin_config_full_analyzer_args(fixtures_dir): + """Return path to NicPlugin config with all analyzer_args populated.""" + return fixtures_dir / "nic_plugin_config_full_analyzer_args.json" + + +@pytest.fixture +def nic_plugin_config_minimal(fixtures_dir): + """Return path to minimal NicPlugin config (niccli_plugin_config.json).""" + return fixtures_dir / "niccli_plugin_config.json" + + +def test_nic_plugin_with_full_analyzer_args_config( + run_cli_command, nic_plugin_config_full_analyzer_args, tmp_path +): + """Test NicPlugin using config with all analyzer_args (performance_profile, getqos, etc.).""" + assert ( + nic_plugin_config_full_analyzer_args.exists() + ), f"Config file not found: {nic_plugin_config_full_analyzer_args}" + + log_path = str(tmp_path / "logs_nic_full_args") + result = run_cli_command( + [ + "--log-path", + log_path, + "--plugin-configs", + str(nic_plugin_config_full_analyzer_args), + ], + check=False, + ) + + assert result.returncode in [ + 0, + 1, + 2, + ], f"Unexpected return code: {result.returncode}. stdout: {result.stdout[:500]!r}" + output = result.stdout + result.stderr + assert len(output) > 0 + assert "NicPlugin" in output or "nic" in output.lower() + + +def test_nic_plugin_with_minimal_config(run_cli_command, nic_plugin_config_minimal, tmp_path): + """Test NicPlugin using minimal config (default collection_args, no analysis_args).""" + assert nic_plugin_config_minimal.exists(), f"Config file not found: {nic_plugin_config_minimal}" + + log_path = str(tmp_path / "logs_nic_minimal") + result = run_cli_command( + ["--log-path", log_path, "--plugin-configs", str(nic_plugin_config_minimal)], + check=False, + ) + + assert result.returncode in [ + 0, + 1, + 2, + ], f"Unexpected return code: {result.returncode}. stdout: {result.stdout[:500]!r}" + output = result.stdout + result.stderr + assert len(output) > 0 + assert "NicPlugin" in output or "nic" in output.lower() + + +def test_nic_plugin_with_run_plugins_subcommand(run_cli_command, tmp_path): + """Test NicPlugin via run-plugins subcommand (no config).""" + log_path = str(tmp_path / "logs_nic_subcommand") + result = run_cli_command(["--log-path", log_path, "run-plugins", "NicPlugin"], check=False) + + assert result.returncode in [ + 0, + 1, + 2, + ], f"Unexpected return code: {result.returncode}. stdout: {result.stdout[:500]!r}" + output = result.stdout + result.stderr + assert len(output) > 0 + assert "NicPlugin" in output or "nic" in output.lower() + + +def test_nic_plugin_full_config_validates_analysis_args( + run_cli_command, nic_plugin_config_full_analyzer_args, tmp_path +): + """Config with all analyzer_args loads and runs without validation error.""" + assert nic_plugin_config_full_analyzer_args.exists() + + log_path = str(tmp_path / "logs_nic_validate") + result = run_cli_command( + [ + "--log-path", + log_path, + "--plugin-configs", + str(nic_plugin_config_full_analyzer_args), + ], + check=False, + ) + + assert result.returncode in [ + 0, + 1, + 2, + ], f"Unexpected return code: {result.returncode}. stdout: {result.stdout[:500]!r}" + output = result.stdout + result.stderr + assert "NicPlugin" in output diff --git a/test/functional/test_plugin_configs.py b/test/functional/test_plugin_configs.py index ec83223c..768e9e6e 100644 --- a/test/functional/test_plugin_configs.py +++ b/test/functional/test_plugin_configs.py @@ -51,10 +51,12 @@ def plugin_config_files(fixtures_dir): "KernelPlugin": fixtures_dir / "kernel_plugin_config.json", "KernelModulePlugin": fixtures_dir / "kernel_module_plugin_config.json", "MemoryPlugin": fixtures_dir / "memory_plugin_config.json", + "NicPlugin": fixtures_dir / "niccli_plugin_config.json", "NvmePlugin": fixtures_dir / "nvme_plugin_config.json", "OsPlugin": fixtures_dir / "os_plugin_config.json", "PackagePlugin": fixtures_dir / "package_plugin_config.json", "ProcessPlugin": fixtures_dir / "process_plugin_config.json", + "RdmaPlugin": fixtures_dir / "rdma_plugin_config.json", "RocmPlugin": fixtures_dir / "rocm_plugin_config.json", "StoragePlugin": fixtures_dir / "storage_plugin_config.json", "SysSettingsPlugin": fixtures_dir / "sys_settings_plugin_config.json", @@ -118,6 +120,7 @@ def test_plugin_config_with_builtin_config(run_cli_command, tmp_path): "OsPlugin", "PackagePlugin", "ProcessPlugin", + "RdmaPlugin", "RocmPlugin", "StoragePlugin", "SysSettingsPlugin", diff --git a/test/functional/test_rdma_plugin.py b/test/functional/test_rdma_plugin.py new file mode 100644 index 00000000..862de3b8 --- /dev/null +++ b/test/functional/test_rdma_plugin.py @@ -0,0 +1,106 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +"""Functional tests for RdmaPlugin with --plugin-configs.""" + +from pathlib import Path + +import pytest + + +@pytest.fixture +def fixtures_dir(): + """Return path to fixtures directory.""" + return Path(__file__).parent / "fixtures" + + +@pytest.fixture +def rdma_config_file(fixtures_dir): + """Return path to RdmaPlugin config file.""" + return fixtures_dir / "rdma_plugin_config.json" + + +def test_rdma_plugin_with_basic_config(run_cli_command, rdma_config_file, tmp_path): + """Test RdmaPlugin using basic config file.""" + assert rdma_config_file.exists(), f"Config file not found: {rdma_config_file}" + + log_path = str(tmp_path / "logs_rdma_basic") + result = run_cli_command( + ["--log-path", log_path, "--plugin-configs", str(rdma_config_file)], check=False + ) + + assert result.returncode in [0, 1, 2] + output = result.stdout + result.stderr + assert len(output) > 0 + assert "rdmaplugin" in output.lower() or "rdma" in output.lower() + + +def test_rdma_plugin_with_run_plugins_subcommand(run_cli_command, tmp_path): + """Test RdmaPlugin using run-plugins subcommand.""" + log_path = str(tmp_path / "logs_rdma_subcommand") + result = run_cli_command(["--log-path", log_path, "run-plugins", "RdmaPlugin"], check=False) + + assert result.returncode in [0, 1, 2] + output = result.stdout + result.stderr + assert len(output) > 0 + + +def test_rdma_plugin_with_passive_interaction(run_cli_command, rdma_config_file, tmp_path): + """Test RdmaPlugin with PASSIVE system interaction level.""" + log_path = str(tmp_path / "logs_rdma_passive") + result = run_cli_command( + [ + "--log-path", + log_path, + "--sys-interaction-level", + "PASSIVE", + "--plugin-configs", + str(rdma_config_file), + ], + check=False, + ) + + assert result.returncode in [0, 1, 2] + output = result.stdout + result.stderr + assert len(output) > 0 + + +def test_rdma_plugin_skip_sudo(run_cli_command, rdma_config_file, tmp_path): + """Test RdmaPlugin with --skip-sudo flag.""" + log_path = str(tmp_path / "logs_rdma_no_sudo") + result = run_cli_command( + [ + "--log-path", + log_path, + "--skip-sudo", + "--plugin-configs", + str(rdma_config_file), + ], + check=False, + ) + + assert result.returncode in [0, 1, 2] + output = result.stdout + result.stderr + assert len(output) > 0 diff --git a/test/functional/test_redfish_endpoint_plugin.py b/test/functional/test_redfish_endpoint_plugin.py new file mode 100644 index 00000000..2a25043e --- /dev/null +++ b/test/functional/test_redfish_endpoint_plugin.py @@ -0,0 +1,114 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from pathlib import Path + +import pytest + + +@pytest.fixture +def fixtures_dir(): + """Return path to functional test fixtures directory.""" + return Path(__file__).parent / "fixtures" + + +@pytest.fixture +def redfish_plugin_config(fixtures_dir): + """Path to RedfishEndpointPlugin config (URIs + checks).""" + return fixtures_dir / "redfish_endpoint_plugin_config.json" + + +@pytest.fixture +def redfish_connection_config(fixtures_dir): + """Path to Redfish connection config (RedfishConnectionManager).""" + return fixtures_dir / "redfish_connection_config.json" + + +def test_redfish_endpoint_plugin_with_config_and_connection( + run_cli_command, redfish_plugin_config, redfish_connection_config, tmp_path +): + assert redfish_plugin_config.exists(), f"Config not found: {redfish_plugin_config}" + assert redfish_connection_config.exists(), f"Config not found: {redfish_connection_config}" + + log_path = str(tmp_path / "logs_redfish") + result = run_cli_command( + [ + "--log-path", + log_path, + "--connection-config", + str(redfish_connection_config), + "--plugin-configs=" + str(redfish_plugin_config), + "run-plugins", + "RedfishEndpointPlugin", + ], + check=False, + ) + + output = result.stdout + result.stderr + assert "RedfishEndpointPlugin" in output or "Redfish" in output + + +def test_redfish_endpoint_plugin_plugin_config_only( + run_cli_command, redfish_plugin_config, tmp_path +): + assert redfish_plugin_config.exists() + + log_path = str(tmp_path / "logs_redfish_noconn") + result = run_cli_command( + [ + "--log-path", + log_path, + "--plugin-configs=" + str(redfish_plugin_config), + "run-plugins", + "RedfishEndpointPlugin", + ], + check=False, + ) + + output = result.stdout + result.stderr + assert "RedfishEndpointPlugin" in output or "Redfish" in output + + +def test_redfish_endpoint_plugin_default_subcommand( + run_cli_command, redfish_plugin_config, redfish_connection_config, tmp_path +): + assert redfish_plugin_config.exists() + assert redfish_connection_config.exists() + + log_path = str(tmp_path / "logs_redfish_default") + result = run_cli_command( + [ + "--log-path", + log_path, + "--connection-config", + str(redfish_connection_config), + "--plugin-configs=" + str(redfish_plugin_config), + "RedfishEndpointPlugin", + ], + check=False, + ) + + output = result.stdout + result.stderr + assert "RedfishEndpointPlugin" in output or "Redfish" in output diff --git a/test/functional/test_reference_config_workflow.py b/test/functional/test_reference_config_workflow.py index 44362149..7b929d38 100644 --- a/test/functional/test_reference_config_workflow.py +++ b/test/functional/test_reference_config_workflow.py @@ -124,6 +124,8 @@ def test_gen_reference_config_subset_plugins(run_cli_command, tmp_path): assert result.returncode in [0, 1, 2] + if result.returncode != 0: + pytest.skip("One or more plugins failed; reference config is not written") reference_config_path = find_reference_config(log_path) assert reference_config_path is not None, "reference_config.json was not created" assert reference_config_path.exists() @@ -148,7 +150,8 @@ def test_use_generated_reference_config(run_cli_command, tmp_path): assert gen_result.returncode in [0, 1, 2] reference_config_path = find_reference_config(gen_log_path) - assert reference_config_path is not None, "reference_config.json was not created" + if reference_config_path is None: + pytest.skip("reference_config.json was not created - one or more plugins failed") assert reference_config_path.exists() use_result = run_cli_command( @@ -238,6 +241,7 @@ def test_reference_config_with_analysis_args(run_cli_command, tmp_path): "OsPlugin", "PackagePlugin", "ProcessPlugin", + "RdmaPlugin", "RocmPlugin", "SysctlPlugin", ] @@ -268,7 +272,11 @@ def test_reference_config_with_analysis_args(run_cli_command, tmp_path): def test_reference_config_structure(run_cli_command, tmp_path): - """Test that generated reference config has correct structure.""" + """Test that reference config is created and has correct structure when no plugin fails. + + Uses OsPlugin only (likely to succeed in any environment). Requires returncode 0 + so we actually assert the success path: reference config is written. + """ log_path = str(tmp_path / "logs_structure") result = run_cli_command( @@ -276,7 +284,10 @@ def test_reference_config_structure(run_cli_command, tmp_path): check=False, ) - assert result.returncode in [0, 1, 2] + assert result.returncode == 0, ( + f"OsPlugin must succeed for this test (reference config only written when no plugin fails). " + f"returncode={result.returncode}, stderr={result.stderr[:500]!r}" + ) reference_config_path = find_reference_config(log_path) assert reference_config_path is not None, "reference_config.json was not created" @@ -305,6 +316,8 @@ def test_gen_reference_config_without_run_plugins(run_cli_command, tmp_path): assert result.returncode in [0, 1, 2] + if result.returncode != 0: + pytest.skip("One or more plugins failed; reference config is not written") reference_config_path = find_reference_config(log_path) assert reference_config_path is not None, "reference_config.json was not created" assert reference_config_path.exists() @@ -332,6 +345,8 @@ def test_reference_config_json_valid(run_cli_command, tmp_path): assert result.returncode in [0, 1, 2] + if result.returncode != 0: + pytest.skip("One or more plugins failed; reference config is not written") reference_config_path = find_reference_config(log_path) assert reference_config_path is not None, "reference_config.json was not created" assert reference_config_path.exists() diff --git a/test/functional/test_run_plugins.py b/test/functional/test_run_plugins.py index c7f6c662..e9381471 100644 --- a/test/functional/test_run_plugins.py +++ b/test/functional/test_run_plugins.py @@ -26,6 +26,7 @@ """Functional tests for running individual plugins.""" import csv +import json from pathlib import Path import pytest @@ -62,6 +63,7 @@ def test_plugin_registry_has_plugins(all_plugins): "OsPlugin", "PackagePlugin", "ProcessPlugin", + "RdmaPlugin", "RocmPlugin", "StoragePlugin", "SysctlPlugin", @@ -175,3 +177,56 @@ def test_run_plugin_with_data_file_no_collection(run_cli_command, tmp_path): f"Bug regression: DmesgPlugin status is NOT_RAN with --data file. " f"Analysis should have run on provided data. Status: {status}" ) + + +def test_rocm_plugin_with_custom_rocm_path_collection_args(run_cli_command, tmp_path): + """Run RocmPlugin with collection_args.rocm_path overriding default /opt/rocm. + + Creates a minimal ROCm-like tree under tmp_path, points the collector at it via + collection_args.rocm_path, and asserts the collected version matches. + """ + custom_version = "5.0.0-999" + rocm_root = tmp_path / "custom_rocm" + info_dir = rocm_root / ".info" + info_dir.mkdir(parents=True) + (info_dir / "version-rocm").write_text(custom_version + "\n") + (info_dir / "version").write_text(custom_version + "\n") + + config = { + "name": "RocmPlugin custom rocm_path", + "desc": "RocmPlugin with collection_args.rocm_path override", + "global_args": {}, + "plugins": { + "RocmPlugin": { + "collection_args": {"rocm_path": str(rocm_root)}, + "analysis_args": {}, + } + }, + "result_collators": {}, + } + config_file = tmp_path / "rocm_custom_path_config.json" + config_file.write_text(json.dumps(config, indent=2)) + + log_path = str(tmp_path / "rocm_custom_logs") + result = run_cli_command( + [ + "--log-path", + log_path, + "--plugin-configs=" + str(config_file), + "run-plugins", + "RocmPlugin", + ], + check=False, + ) + + output = result.stdout + result.stderr + assert "RocmPlugin" in output + log_dir = Path(log_path) + csv_files = list(log_dir.glob("**/nodescraper.csv")) + if csv_files: + with open(csv_files[0], "r", encoding="utf-8") as f: + reader = csv.DictReader(f) + rows = [r for r in reader if r.get("plugin") == "RocmPlugin"] + assert len(rows) >= 1, f"RocmPlugin should appear in CSV under {log_path}" + assert rows[0].get("status") != "NOT_RAN" + assert rows[0].get("message") diff --git a/test/unit/conftest.py b/test/unit/conftest.py index f44899c0..76c1fe26 100644 --- a/test/unit/conftest.py +++ b/test/unit/conftest.py @@ -50,6 +50,11 @@ def conn_mock(): return MagicMock() +@pytest.fixture +def redfish_conn_mock(): + return MagicMock() + + class DummyDataModel(DataModel): foo: int diff --git a/test/unit/connection/redfish/test_redfish_oem_diag.py b/test/unit/connection/redfish/test_redfish_oem_diag.py new file mode 100644 index 00000000..727c1b69 --- /dev/null +++ b/test/unit/connection/redfish/test_redfish_oem_diag.py @@ -0,0 +1,186 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +import logging +from unittest.mock import MagicMock + +from requests.status_codes import codes + +from nodescraper.connection.redfish import RedfishConnectionError +from nodescraper.connection.redfish.redfish_oem_diag import ( + DEFAULT_TASK_TIMEOUT_S, + RF_ANNOTATION_ALLOWABLE, + _download_log_and_save, + _get_task_monitor_uri, + _strip_port_from_url, + get_oem_diagnostic_allowable_values, +) + + +def test_rf_annotation_allowable_constant(): + assert RF_ANNOTATION_ALLOWABLE == "OEMDiagnosticDataType@Redfish.AllowableValues" + + +def test_default_task_timeout_s(): + assert DEFAULT_TASK_TIMEOUT_S == 1800 + + +class TestGetOemDiagnosticAllowableValues: + def test_returns_list_from_collect_action(self): + conn = MagicMock() + conn.get.return_value = { + "Actions": { + "LogService.CollectDiagnosticData": { + "OEMDiagnosticDataType@Redfish.AllowableValues": ["Dmesg", "AllLogs"], + } + } + } + result = get_oem_diagnostic_allowable_values( + conn, "redfish/v1/Systems/1/LogServices/DiagLogs" + ) + assert result == ["Dmesg", "AllLogs"] + + def test_returns_list_from_octothorpe_action_key(self): + conn = MagicMock() + conn.get.return_value = { + "Actions": { + "#LogService.CollectDiagnosticData": { + "OEMDiagnosticDataType@Redfish.AllowableValues": ["JournalControl"], + } + } + } + result = get_oem_diagnostic_allowable_values( + conn, "redfish/v1/Systems/UBB/LogServices/DiagLogs" + ) + assert result == ["JournalControl"] + + def test_returns_none_on_connection_error(self): + conn = MagicMock() + conn.get.side_effect = RedfishConnectionError("fail") + result = get_oem_diagnostic_allowable_values(conn, "redfish/v1/LogServices/DiagLogs") + assert result is None + + def test_returns_none_when_data_not_dict(self): + conn = MagicMock() + conn.get.return_value = [] + result = get_oem_diagnostic_allowable_values(conn, "redfish/v1/LogServices/DiagLogs") + assert result is None + + def test_returns_none_when_no_actions(self): + conn = MagicMock() + conn.get.return_value = {} + result = get_oem_diagnostic_allowable_values(conn, "redfish/v1/LogServices/DiagLogs") + assert result is None + + +class TestStripPortFromUrl: + def test_strips_port_443(self): + url = "https://host:443/redfish/v1/TaskService/Tasks/1" + assert _strip_port_from_url(url) == "https://host/redfish/v1/TaskService/Tasks/1" + + def test_strips_other_port(self): + url = "https://host:8443/redfish/v1" + assert _strip_port_from_url(url) == "https://host/redfish/v1" + + def test_returns_none_when_no_port(self): + url = "https://host/redfish/v1" + assert _strip_port_from_url(url) is None + + def test_returns_none_for_relative_path(self): + assert _strip_port_from_url("redfish/v1/Systems/1") is None + + +class TestGetTaskMonitorUri: + def test_returns_task_monitor_from_body(self): + conn = MagicMock() + conn.base_url = "https://host/redfish/v1" + body = {"TaskMonitor": "TaskService/Tasks/1/Monitor"} + result = _get_task_monitor_uri(body, conn) + assert result == "https://host/redfish/v1/TaskService/Tasks/1/Monitor" + + def test_returns_from_odata_id_plus_monitor(self): + conn = MagicMock() + conn.base_url = "https://host/redfish/v1" + body = {"@odata.id": "TaskService/Tasks/1"} + result = _get_task_monitor_uri(body, conn) + assert result == "https://host/redfish/v1/TaskService/Tasks/1/Monitor" + + def test_returns_none_for_empty_body(self): + conn = MagicMock() + assert _get_task_monitor_uri({}, conn) is None + + def test_prefers_task_monitor_over_odata_id(self): + conn = MagicMock() + conn.base_url = "https://host/redfish/v1" + body = { + "TaskMonitor": "TaskService/Tasks/1/Monitor", + "@odata.id": "TaskService/Tasks/2", + } + result = _get_task_monitor_uri(body, conn) + assert result == "https://host/redfish/v1/TaskService/Tasks/1/Monitor" + + +class TestDownloadLogAndSave: + def test_returns_none_when_no_additional_data_uri(self): + conn = MagicMock() + log_entry_json = {"Id": "1", "Name": "LogEntry"} + result = _download_log_and_save( + conn, log_entry_json, "Dmesg", None, logging.getLogger("test") + ) + assert result is None + conn.get_response.assert_not_called() + + def test_downloads_and_returns_bytes_when_additional_data_uri_present(self): + conn = MagicMock() + conn.base_url = "https://host/redfish/v1" + resp = MagicMock() + resp.status_code = codes.ok + resp.content = b"log bytes" + conn.get_response.return_value = resp + log_entry_json = {"AdditionalDataURI": "/redfish/v1/LogServices/1/Entries/1/Attachment"} + result = _download_log_and_save( + conn, log_entry_json, "Dmesg", None, logging.getLogger("test") + ) + assert result == b"log bytes" + conn.get_response.assert_called_once() + + def test_writes_archive_and_metadata_to_output_dir(self, tmp_path): + conn = MagicMock() + conn.base_url = "https://host/redfish/v1" + resp = MagicMock() + resp.status_code = codes.ok + resp.content = b"log bytes" + conn.get_response.return_value = resp + log_entry_json = { + "AdditionalDataURI": "/redfish/v1/LogServices/1/Entries/1/Attachment", + "Id": "1", + } + result = _download_log_and_save( + conn, log_entry_json, "AllLogs", tmp_path, logging.getLogger("test") + ) + assert result == b"log bytes" + assert (tmp_path / "AllLogs.tar.xz").read_bytes() == b"log bytes" + metadata = (tmp_path / "AllLogs_log_entry.json").read_text(encoding="utf-8") + assert "Id" in metadata and "1" in metadata diff --git a/test/unit/framework/test_plugin_executor.py b/test/unit/framework/test_plugin_executor.py index cdeb4e50..a5121398 100644 --- a/test/unit/framework/test_plugin_executor.py +++ b/test/unit/framework/test_plugin_executor.py @@ -87,7 +87,8 @@ def plugin_registry(): PluginConfig(plugins={"Plugin1": {"arg1": "val1", "argA": "valA"}}), PluginConfig(plugins={"Plugin1": {"arg1": "val2"}}), ], - PluginConfig(plugins={"Plugin1": {"arg1": "val2"}}), + # Deep merge: later config's keys override, existing keys preserved. + PluginConfig(plugins={"Plugin1": {"arg1": "val2", "argA": "valA"}}), ), ( [ diff --git a/test/unit/plugin/fixtures/rdma_link_example_data.json b/test/unit/plugin/fixtures/rdma_link_example_data.json new file mode 100644 index 00000000..6c228a81 --- /dev/null +++ b/test/unit/plugin/fixtures/rdma_link_example_data.json @@ -0,0 +1,38 @@ +[ + { + "ifindex": 0, + "ifname": "ionic_0", + "port": 1, + "state": "ACTIVE", + "physical_state": "LINK_UP", + "netdev": "benic8p1", + "netdev_index": 3 + }, + { + "ifindex": 1, + "ifname": "ionic_1", + "port": 1, + "state": "ACTIVE", + "physical_state": "LINK_UP", + "netdev": "benic7p1", + "netdev_index": 6 + }, + { + "ifindex": 2, + "ifname": "ionic_2", + "port": 1, + "state": "ACTIVE", + "physical_state": "LINK_UP", + "netdev": "benic5p1", + "netdev_index": 8 + }, + { + "ifindex": 3, + "ifname": "ionic_3", + "port": 1, + "state": "ACTIVE", + "physical_state": "LINK_UP", + "netdev": "benic6p1", + "netdev_index": 9 + } +] diff --git a/test/unit/plugin/fixtures/rdma_statistic_example_data.json b/test/unit/plugin/fixtures/rdma_statistic_example_data.json new file mode 100644 index 00000000..e338e41a --- /dev/null +++ b/test/unit/plugin/fixtures/rdma_statistic_example_data.json @@ -0,0 +1,826 @@ +[ + { + "ifname": "bnxt_re0", + "port": 1, + "active_pds": 1, + "active_ahs": 0, + "active_qps": 1, + "active_rc_qps": 0, + "active_ud_qps": 0, + "active_srqs": 0, + "active_cqs": 1, + "active_mrs": 0, + "active_mws": 0, + "watermark_pds": 12, + "watermark_ahs": 8, + "watermark_qps": 229, + "watermark_rc_qps": 220, + "watermark_ud_qps": 8, + "watermark_srqs": 8, + "watermark_cqs": 94, + "watermark_mrs": 305, + "watermark_mws": 0, + "rx_pkts": 3504998440, + "rx_bytes": 2966950848, + "tx_pkts": 2747190987, + "tx_bytes": 912073550, + "recoverable_errors": 0, + "tx_roce_errors": 0, + "tx_roce_discards": 0, + "rx_roce_errors": 0, + "rx_roce_discards": 0, + "local_ack_timeout_err": 0, + "packet_seq_err": 0, + "max_retry_exceeded": 0, + "rnr_nak_retry_err": 0, + "implied_nak_seq_err": 0, + "unrecoverable_err": 0, + "bad_resp_err": 0, + "local_qp_op_err": 0, + "local_protection_err": 0, + "mem_mgmt_op_err": 0, + "req_remote_invalid_request": 0, + "req_remote_access_errors": 0, + "remote_op_err": 0, + "duplicate_request": 0, + "res_exceed_max": 0, + "resp_local_length_error": 0, + "res_exceeds_wqe": 0, + "res_opcode_err": 0, + "res_rx_invalid_rkey": 0, + "res_rx_domain_err": 0, + "res_rx_no_perm": 0, + "res_rx_range_err": 0, + "res_tx_invalid_rkey": 0, + "res_tx_domain_err": 0, + "res_tx_no_perm": 0, + "res_tx_range_err": 0, + "res_irrq_oflow": 0, + "res_unsup_opcode": 0, + "res_unaligned_atomic": 0, + "res_rem_inv_err": 0, + "res_mem_err": 0, + "res_srq_err": 0, + "res_cmp_err": 0, + "res_invalid_dup_rkey": 0, + "res_wqe_format_err": 0, + "res_cq_load_err": 0, + "res_srq_load_err": 0, + "res_tx_pci_err": 0, + "res_rx_pci_err": 0, + "tx_atomic_req": 0, + "tx_read_req": 3324056122, + "tx_read_resp": 3324056122, + "tx_write_req": 622240024, + "tx_send_req": 97500, + "rx_atomic_requests": 0, + "rx_read_requests": 3324056122, + "rx_read_resp": 3324056122, + "rx_write_requests": 626374468, + "rx_send_req": 97500, + "rx_good_pkts": 1401322762, + "rx_good_bytes": 2966950848, + "out_of_buffer": 0, + "np_cnp_sent": 2873487760, + "rp_cnp_handled": 2103675678, + "np_ecn_marked_roce_packets": 2873487760, + "out_of_sequence": 0, + "pacing_reschedule": 0, + "pacing_complete": 0, + "pacing_alerts": 0, + "db_fifo_register": 2147450881, + "req_cqe_error": 0, + "req_cqe_flush_error": 0, + "resp_cqe_error": 0, + "resp_cqe_flush_error": 0, + "resp_remote_access_errors": 0, + "roce_adp_retrans": 0, + "roce_adp_retrans_to": 0, + "roce_slow_restart": 0, + "roce_slow_restart_cnps": 0, + "roce_slow_restart_trans": 0, + "rp_cnp_ignored": 0, + "rx_icrc_encapsulated": 0 + }, + { + "ifname": "bnxt_re1", + "port": 1, + "active_pds": 1, + "active_ahs": 0, + "active_qps": 1, + "active_rc_qps": 0, + "active_ud_qps": 0, + "active_srqs": 0, + "active_cqs": 1, + "active_mrs": 0, + "active_mws": 0, + "watermark_pds": 14, + "watermark_ahs": 3, + "watermark_qps": 228, + "watermark_rc_qps": 219, + "watermark_ud_qps": 8, + "watermark_srqs": 8, + "watermark_cqs": 94, + "watermark_mrs": 287, + "watermark_mws": 0, + "rx_pkts": 1509751895, + "rx_bytes": 3099873130, + "tx_pkts": 692925073, + "tx_bytes": 2068663286, + "recoverable_errors": 0, + "tx_roce_errors": 0, + "tx_roce_discards": 0, + "rx_roce_errors": 0, + "rx_roce_discards": 0, + "local_ack_timeout_err": 0, + "packet_seq_err": 0, + "max_retry_exceeded": 0, + "rnr_nak_retry_err": 0, + "implied_nak_seq_err": 0, + "unrecoverable_err": 0, + "bad_resp_err": 0, + "local_qp_op_err": 0, + "local_protection_err": 0, + "mem_mgmt_op_err": 0, + "req_remote_invalid_request": 0, + "req_remote_access_errors": 0, + "remote_op_err": 0, + "duplicate_request": 0, + "res_exceed_max": 0, + "resp_local_length_error": 0, + "res_exceeds_wqe": 0, + "res_opcode_err": 0, + "res_rx_invalid_rkey": 0, + "res_rx_domain_err": 0, + "res_rx_no_perm": 0, + "res_rx_range_err": 0, + "res_tx_invalid_rkey": 0, + "res_tx_domain_err": 0, + "res_tx_no_perm": 0, + "res_tx_range_err": 0, + "res_irrq_oflow": 0, + "res_unsup_opcode": 0, + "res_unaligned_atomic": 0, + "res_rem_inv_err": 0, + "res_mem_err": 0, + "res_srq_err": 0, + "res_cmp_err": 0, + "res_invalid_dup_rkey": 0, + "res_wqe_format_err": 0, + "res_cq_load_err": 0, + "res_srq_load_err": 0, + "res_tx_pci_err": 0, + "res_rx_pci_err": 0, + "tx_atomic_req": 0, + "tx_read_req": 3322387232, + "tx_read_resp": 3322387232, + "tx_write_req": 620621144, + "tx_send_req": 0, + "rx_atomic_requests": 0, + "rx_read_requests": 3322387232, + "rx_read_resp": 3322387232, + "rx_write_requests": 621181433, + "rx_send_req": 0, + "rx_good_pkts": 3507768689, + "rx_good_bytes": 3099873130, + "out_of_buffer": 0, + "np_cnp_sent": 1097578610, + "rp_cnp_handled": 2296950502, + "np_ecn_marked_roce_packets": 1097578610, + "out_of_sequence": 0, + "pacing_reschedule": 0, + "pacing_complete": 0, + "pacing_alerts": 0, + "db_fifo_register": 2147450881, + "req_cqe_error": 0, + "req_cqe_flush_error": 0, + "resp_cqe_error": 0, + "resp_cqe_flush_error": 0, + "resp_remote_access_errors": 0, + "roce_adp_retrans": 0, + "roce_adp_retrans_to": 0, + "roce_slow_restart": 0, + "roce_slow_restart_cnps": 0, + "roce_slow_restart_trans": 0, + "rp_cnp_ignored": 0, + "rx_icrc_encapsulated": 0 + }, + { + "ifname": "bnxt_re2", + "port": 1, + "active_pds": 1, + "active_ahs": 0, + "active_qps": 1, + "active_rc_qps": 0, + "active_ud_qps": 0, + "active_srqs": 0, + "active_cqs": 1, + "active_mrs": 0, + "active_mws": 0, + "watermark_pds": 13, + "watermark_ahs": 4, + "watermark_qps": 230, + "watermark_rc_qps": 221, + "watermark_ud_qps": 8, + "watermark_srqs": 8, + "watermark_cqs": 95, + "watermark_mrs": 294, + "watermark_mws": 0, + "rx_pkts": 2328181128, + "rx_bytes": 79750872, + "tx_pkts": 1404869338, + "tx_bytes": 644434628, + "recoverable_errors": 0, + "tx_roce_errors": 0, + "tx_roce_discards": 0, + "rx_roce_errors": 0, + "rx_roce_discards": 0, + "local_ack_timeout_err": 0, + "packet_seq_err": 0, + "max_retry_exceeded": 0, + "rnr_nak_retry_err": 0, + "implied_nak_seq_err": 0, + "unrecoverable_err": 0, + "bad_resp_err": 0, + "local_qp_op_err": 0, + "local_protection_err": 0, + "mem_mgmt_op_err": 0, + "req_remote_invalid_request": 0, + "req_remote_access_errors": 0, + "remote_op_err": 0, + "duplicate_request": 0, + "res_exceed_max": 0, + "resp_local_length_error": 0, + "res_exceeds_wqe": 0, + "res_opcode_err": 0, + "res_rx_invalid_rkey": 0, + "res_rx_domain_err": 0, + "res_rx_no_perm": 0, + "res_rx_range_err": 0, + "res_tx_invalid_rkey": 0, + "res_tx_domain_err": 0, + "res_tx_no_perm": 0, + "res_tx_range_err": 0, + "res_irrq_oflow": 0, + "res_unsup_opcode": 0, + "res_unaligned_atomic": 0, + "res_rem_inv_err": 0, + "res_mem_err": 0, + "res_srq_err": 0, + "res_cmp_err": 0, + "res_invalid_dup_rkey": 0, + "res_wqe_format_err": 0, + "res_cq_load_err": 0, + "res_srq_load_err": 0, + "res_tx_pci_err": 0, + "res_rx_pci_err": 0, + "tx_atomic_req": 0, + "tx_read_req": 3212760135, + "tx_read_resp": 3212760135, + "tx_write_req": 1995861174, + "tx_send_req": 0, + "rx_atomic_requests": 0, + "rx_read_requests": 3212760135, + "rx_read_resp": 3212760135, + "rx_write_requests": 1995579948, + "rx_send_req": 0, + "rx_good_pkts": 4025638368, + "rx_good_bytes": 79750872, + "out_of_buffer": 0, + "np_cnp_sent": 4174752904, + "rp_cnp_handled": 2597510056, + "np_ecn_marked_roce_packets": 4174752904, + "out_of_sequence": 0, + "pacing_reschedule": 0, + "pacing_complete": 0, + "pacing_alerts": 0, + "db_fifo_register": 2147450881, + "req_cqe_error": 0, + "req_cqe_flush_error": 0, + "resp_cqe_error": 0, + "resp_cqe_flush_error": 0, + "resp_remote_access_errors": 0, + "roce_adp_retrans": 0, + "roce_adp_retrans_to": 0, + "roce_slow_restart": 0, + "roce_slow_restart_cnps": 0, + "roce_slow_restart_trans": 0, + "rp_cnp_ignored": 0, + "rx_icrc_encapsulated": 0 + }, + { + "ifname": "bnxt_re3", + "port": 1, + "active_pds": 1, + "active_ahs": 0, + "active_qps": 1, + "active_rc_qps": 0, + "active_ud_qps": 0, + "active_srqs": 0, + "active_cqs": 1, + "active_mrs": 0, + "active_mws": 0, + "watermark_pds": 12, + "watermark_ahs": 7, + "watermark_qps": 229, + "watermark_rc_qps": 220, + "watermark_ud_qps": 8, + "watermark_srqs": 8, + "watermark_cqs": 95, + "watermark_mrs": 292, + "watermark_mws": 0, + "rx_pkts": 3888070733, + "rx_bytes": 3748987850, + "tx_pkts": 2265082996, + "tx_bytes": 3715380316, + "recoverable_errors": 0, + "tx_roce_errors": 0, + "tx_roce_discards": 0, + "rx_roce_errors": 0, + "rx_roce_discards": 0, + "local_ack_timeout_err": 0, + "packet_seq_err": 0, + "max_retry_exceeded": 0, + "rnr_nak_retry_err": 0, + "implied_nak_seq_err": 0, + "unrecoverable_err": 0, + "bad_resp_err": 0, + "local_qp_op_err": 0, + "local_protection_err": 0, + "mem_mgmt_op_err": 0, + "req_remote_invalid_request": 0, + "req_remote_access_errors": 0, + "remote_op_err": 0, + "duplicate_request": 0, + "res_exceed_max": 0, + "resp_local_length_error": 0, + "res_exceeds_wqe": 0, + "res_opcode_err": 0, + "res_rx_invalid_rkey": 0, + "res_rx_domain_err": 0, + "res_rx_no_perm": 0, + "res_rx_range_err": 0, + "res_tx_invalid_rkey": 0, + "res_tx_domain_err": 0, + "res_tx_no_perm": 0, + "res_tx_range_err": 0, + "res_irrq_oflow": 0, + "res_unsup_opcode": 0, + "res_unaligned_atomic": 0, + "res_rem_inv_err": 0, + "res_mem_err": 0, + "res_srq_err": 0, + "res_cmp_err": 0, + "res_invalid_dup_rkey": 0, + "res_wqe_format_err": 0, + "res_cq_load_err": 0, + "res_srq_load_err": 0, + "res_tx_pci_err": 0, + "res_rx_pci_err": 0, + "tx_atomic_req": 0, + "tx_read_req": 3103369202, + "tx_read_resp": 3103369202, + "tx_write_req": 3370635080, + "tx_send_req": 0, + "rx_atomic_requests": 0, + "rx_read_requests": 3103369202, + "rx_read_resp": 3103369202, + "rx_write_requests": 3368547249, + "rx_send_req": 0, + "rx_good_pkts": 2688805201, + "rx_good_bytes": 3748987850, + "out_of_buffer": 0, + "np_cnp_sent": 134598312, + "rp_cnp_handled": 1199265532, + "np_ecn_marked_roce_packets": 134598312, + "out_of_sequence": 0, + "pacing_reschedule": 0, + "pacing_complete": 0, + "pacing_alerts": 0, + "db_fifo_register": 2147450881, + "req_cqe_error": 0, + "req_cqe_flush_error": 0, + "resp_cqe_error": 0, + "resp_cqe_flush_error": 0, + "resp_remote_access_errors": 0, + "roce_adp_retrans": 0, + "roce_adp_retrans_to": 0, + "roce_slow_restart": 0, + "roce_slow_restart_cnps": 0, + "roce_slow_restart_trans": 0, + "rp_cnp_ignored": 0, + "rx_icrc_encapsulated": 0 + }, + { + "ifname": "bnxt_re4", + "port": 1, + "active_pds": 1, + "active_ahs": 0, + "active_qps": 1, + "active_rc_qps": 0, + "active_ud_qps": 0, + "active_srqs": 0, + "active_cqs": 1, + "active_mrs": 0, + "active_mws": 0, + "watermark_pds": 12, + "watermark_ahs": 6, + "watermark_qps": 230, + "watermark_rc_qps": 221, + "watermark_ud_qps": 8, + "watermark_srqs": 8, + "watermark_cqs": 95, + "watermark_mrs": 302, + "watermark_mws": 0, + "rx_pkts": 986831570, + "rx_bytes": 1185181414, + "tx_pkts": 1975828812, + "tx_bytes": 2763928250, + "recoverable_errors": 0, + "tx_roce_errors": 0, + "tx_roce_discards": 0, + "rx_roce_errors": 0, + "rx_roce_discards": 0, + "local_ack_timeout_err": 0, + "packet_seq_err": 0, + "max_retry_exceeded": 0, + "rnr_nak_retry_err": 0, + "implied_nak_seq_err": 0, + "unrecoverable_err": 0, + "bad_resp_err": 0, + "local_qp_op_err": 0, + "local_protection_err": 0, + "mem_mgmt_op_err": 0, + "req_remote_invalid_request": 0, + "req_remote_access_errors": 0, + "remote_op_err": 0, + "duplicate_request": 0, + "res_exceed_max": 0, + "resp_local_length_error": 0, + "res_exceeds_wqe": 0, + "res_opcode_err": 0, + "res_rx_invalid_rkey": 0, + "res_rx_domain_err": 0, + "res_rx_no_perm": 0, + "res_rx_range_err": 0, + "res_tx_invalid_rkey": 0, + "res_tx_domain_err": 0, + "res_tx_no_perm": 0, + "res_tx_range_err": 0, + "res_irrq_oflow": 0, + "res_unsup_opcode": 0, + "res_unaligned_atomic": 0, + "res_rem_inv_err": 0, + "res_mem_err": 0, + "res_srq_err": 0, + "res_cmp_err": 0, + "res_invalid_dup_rkey": 0, + "res_wqe_format_err": 0, + "res_cq_load_err": 0, + "res_srq_load_err": 0, + "res_tx_pci_err": 0, + "res_rx_pci_err": 0, + "tx_atomic_req": 0, + "tx_read_req": 2993618119, + "tx_read_resp": 2993618119, + "tx_write_req": 449606302, + "tx_send_req": 37687, + "rx_atomic_requests": 0, + "rx_read_requests": 2993618119, + "rx_read_resp": 2993618119, + "rx_write_requests": 448485514, + "rx_send_req": 37687, + "rx_good_pkts": 2876478595, + "rx_good_bytes": 1185181414, + "out_of_buffer": 0, + "np_cnp_sent": 3525492995, + "rp_cnp_handled": 2405320271, + "np_ecn_marked_roce_packets": 3525492995, + "out_of_sequence": 0, + "pacing_reschedule": 0, + "pacing_complete": 0, + "pacing_alerts": 0, + "db_fifo_register": 2147450881, + "req_cqe_error": 0, + "req_cqe_flush_error": 0, + "resp_cqe_error": 0, + "resp_cqe_flush_error": 0, + "resp_remote_access_errors": 0, + "roce_adp_retrans": 0, + "roce_adp_retrans_to": 0, + "roce_slow_restart": 0, + "roce_slow_restart_cnps": 0, + "roce_slow_restart_trans": 0, + "rp_cnp_ignored": 0, + "rx_icrc_encapsulated": 0 + }, + { + "ifname": "bnxt_re5", + "port": 1, + "active_pds": 1, + "active_ahs": 0, + "active_qps": 1, + "active_rc_qps": 0, + "active_ud_qps": 0, + "active_srqs": 0, + "active_cqs": 1, + "active_mrs": 0, + "active_mws": 0, + "watermark_pds": 13, + "watermark_ahs": 7, + "watermark_qps": 228, + "watermark_rc_qps": 219, + "watermark_ud_qps": 8, + "watermark_srqs": 8, + "watermark_cqs": 94, + "watermark_mrs": 287, + "watermark_mws": 0, + "rx_pkts": 3602164391, + "rx_bytes": 515322372, + "tx_pkts": 3498885620, + "tx_bytes": 3601952844, + "recoverable_errors": 0, + "tx_roce_errors": 0, + "tx_roce_discards": 0, + "rx_roce_errors": 0, + "rx_roce_discards": 0, + "local_ack_timeout_err": 0, + "packet_seq_err": 0, + "max_retry_exceeded": 0, + "rnr_nak_retry_err": 0, + "implied_nak_seq_err": 0, + "unrecoverable_err": 0, + "bad_resp_err": 0, + "local_qp_op_err": 0, + "local_protection_err": 0, + "mem_mgmt_op_err": 0, + "req_remote_invalid_request": 0, + "req_remote_access_errors": 0, + "remote_op_err": 0, + "duplicate_request": 0, + "res_exceed_max": 0, + "resp_local_length_error": 0, + "res_exceeds_wqe": 0, + "res_opcode_err": 0, + "res_rx_invalid_rkey": 0, + "res_rx_domain_err": 0, + "res_rx_no_perm": 0, + "res_rx_range_err": 0, + "res_tx_invalid_rkey": 0, + "res_tx_domain_err": 0, + "res_tx_no_perm": 0, + "res_tx_range_err": 0, + "res_irrq_oflow": 0, + "res_unsup_opcode": 0, + "res_unaligned_atomic": 0, + "res_rem_inv_err": 0, + "res_mem_err": 0, + "res_srq_err": 0, + "res_cmp_err": 0, + "res_invalid_dup_rkey": 0, + "res_wqe_format_err": 0, + "res_cq_load_err": 0, + "res_srq_load_err": 0, + "res_tx_pci_err": 0, + "res_rx_pci_err": 0, + "tx_atomic_req": 0, + "tx_read_req": 2883798845, + "tx_read_resp": 2883798845, + "tx_write_req": 1822414941, + "tx_send_req": 0, + "rx_atomic_requests": 0, + "rx_read_requests": 2883798845, + "rx_read_resp": 2883798845, + "rx_write_requests": 1819507161, + "rx_send_req": 0, + "rx_good_pkts": 1576292710, + "rx_good_bytes": 515322372, + "out_of_buffer": 0, + "np_cnp_sent": 4093842522, + "rp_cnp_handled": 2025871681, + "np_ecn_marked_roce_packets": 4093842522, + "out_of_sequence": 0, + "pacing_reschedule": 0, + "pacing_complete": 0, + "pacing_alerts": 0, + "db_fifo_register": 2147450881, + "req_cqe_error": 0, + "req_cqe_flush_error": 0, + "resp_cqe_error": 0, + "resp_cqe_flush_error": 0, + "resp_remote_access_errors": 0, + "roce_adp_retrans": 0, + "roce_adp_retrans_to": 0, + "roce_slow_restart": 0, + "roce_slow_restart_cnps": 0, + "roce_slow_restart_trans": 0, + "rp_cnp_ignored": 0, + "rx_icrc_encapsulated": 0 + }, + { + "ifname": "bnxt_re6", + "port": 1, + "active_pds": 1, + "active_ahs": 0, + "active_qps": 1, + "active_rc_qps": 0, + "active_ud_qps": 0, + "active_srqs": 0, + "active_cqs": 1, + "active_mrs": 0, + "active_mws": 0, + "watermark_pds": 13, + "watermark_ahs": 7, + "watermark_qps": 230, + "watermark_rc_qps": 221, + "watermark_ud_qps": 8, + "watermark_srqs": 8, + "watermark_cqs": 95, + "watermark_mrs": 294, + "watermark_mws": 0, + "rx_pkts": 2577272275, + "rx_bytes": 2249875450, + "tx_pkts": 2452138468, + "tx_bytes": 700557582, + "recoverable_errors": 0, + "tx_roce_errors": 0, + "tx_roce_discards": 0, + "rx_roce_errors": 0, + "rx_roce_discards": 0, + "local_ack_timeout_err": 0, + "packet_seq_err": 0, + "max_retry_exceeded": 0, + "rnr_nak_retry_err": 0, + "implied_nak_seq_err": 0, + "unrecoverable_err": 0, + "bad_resp_err": 0, + "local_qp_op_err": 0, + "local_protection_err": 0, + "mem_mgmt_op_err": 0, + "req_remote_invalid_request": 0, + "req_remote_access_errors": 0, + "remote_op_err": 0, + "duplicate_request": 0, + "res_exceed_max": 0, + "resp_local_length_error": 0, + "res_exceeds_wqe": 0, + "res_opcode_err": 0, + "res_rx_invalid_rkey": 0, + "res_rx_domain_err": 0, + "res_rx_no_perm": 0, + "res_rx_range_err": 0, + "res_tx_invalid_rkey": 0, + "res_tx_domain_err": 0, + "res_tx_no_perm": 0, + "res_tx_range_err": 0, + "res_irrq_oflow": 0, + "res_unsup_opcode": 0, + "res_unaligned_atomic": 0, + "res_rem_inv_err": 0, + "res_mem_err": 0, + "res_srq_err": 0, + "res_cmp_err": 0, + "res_invalid_dup_rkey": 0, + "res_wqe_format_err": 0, + "res_cq_load_err": 0, + "res_srq_load_err": 0, + "res_tx_pci_err": 0, + "res_rx_pci_err": 0, + "tx_atomic_req": 0, + "tx_read_req": 2775090592, + "tx_read_resp": 2775090592, + "tx_write_req": 3201764210, + "tx_send_req": 0, + "rx_atomic_requests": 0, + "rx_read_requests": 2775090592, + "rx_read_resp": 2775090592, + "rx_write_requests": 3201655162, + "rx_send_req": 0, + "rx_good_pkts": 1197866395, + "rx_good_bytes": 2249875450, + "out_of_buffer": 0, + "np_cnp_sent": 2401103251, + "rp_cnp_handled": 1379405880, + "np_ecn_marked_roce_packets": 2401103251, + "out_of_sequence": 0, + "pacing_reschedule": 0, + "pacing_complete": 0, + "pacing_alerts": 0, + "db_fifo_register": 2147450881, + "req_cqe_error": 0, + "req_cqe_flush_error": 0, + "resp_cqe_error": 0, + "resp_cqe_flush_error": 0, + "resp_remote_access_errors": 0, + "roce_adp_retrans": 0, + "roce_adp_retrans_to": 0, + "roce_slow_restart": 0, + "roce_slow_restart_cnps": 0, + "roce_slow_restart_trans": 0, + "rp_cnp_ignored": 0, + "rx_icrc_encapsulated": 0 + }, + { + "ifname": "bnxt_re7", + "port": 1, + "active_pds": 1, + "active_ahs": 0, + "active_qps": 1, + "active_rc_qps": 0, + "active_ud_qps": 0, + "active_srqs": 0, + "active_cqs": 1, + "active_mrs": 0, + "active_mws": 0, + "watermark_pds": 13, + "watermark_ahs": 6, + "watermark_qps": 228, + "watermark_rc_qps": 219, + "watermark_ud_qps": 8, + "watermark_srqs": 8, + "watermark_cqs": 94, + "watermark_mrs": 287, + "watermark_mws": 0, + "rx_pkts": 1606921676, + "rx_bytes": 4007942950, + "tx_pkts": 1249198409, + "tx_bytes": 25134278, + "recoverable_errors": 0, + "tx_roce_errors": 0, + "tx_roce_discards": 0, + "rx_roce_errors": 0, + "rx_roce_discards": 0, + "local_ack_timeout_err": 0, + "packet_seq_err": 0, + "max_retry_exceeded": 0, + "rnr_nak_retry_err": 0, + "implied_nak_seq_err": 0, + "unrecoverable_err": 0, + "bad_resp_err": 0, + "local_qp_op_err": 0, + "local_protection_err": 0, + "mem_mgmt_op_err": 0, + "req_remote_invalid_request": 0, + "req_remote_access_errors": 0, + "remote_op_err": 0, + "duplicate_request": 0, + "res_exceed_max": 0, + "resp_local_length_error": 0, + "res_exceeds_wqe": 0, + "res_opcode_err": 0, + "res_rx_invalid_rkey": 0, + "res_rx_domain_err": 0, + "res_rx_no_perm": 0, + "res_rx_range_err": 0, + "res_tx_invalid_rkey": 0, + "res_tx_domain_err": 0, + "res_tx_no_perm": 0, + "res_tx_range_err": 0, + "res_irrq_oflow": 0, + "res_unsup_opcode": 0, + "res_unaligned_atomic": 0, + "res_rem_inv_err": 0, + "res_mem_err": 0, + "res_srq_err": 0, + "res_cmp_err": 0, + "res_invalid_dup_rkey": 0, + "res_wqe_format_err": 0, + "res_cq_load_err": 0, + "res_srq_load_err": 0, + "res_tx_pci_err": 0, + "res_rx_pci_err": 0, + "tx_atomic_req": 0, + "tx_read_req": 2665758274, + "tx_read_resp": 2665758274, + "tx_write_req": 284646587, + "tx_send_req": 0, + "rx_atomic_requests": 0, + "rx_read_requests": 2665758274, + "rx_read_resp": 2665758274, + "rx_write_requests": 284542358, + "rx_send_req": 0, + "rx_good_pkts": 253070639, + "rx_good_bytes": 4007942950, + "out_of_buffer": 0, + "np_cnp_sent": 2670842510, + "rp_cnp_handled": 1353851037, + "np_ecn_marked_roce_packets": 2670842510, + "out_of_sequence": 0, + "pacing_reschedule": 0, + "pacing_complete": 0, + "pacing_alerts": 0, + "db_fifo_register": 2147450881, + "req_cqe_error": 0, + "req_cqe_flush_error": 0, + "resp_cqe_error": 0, + "resp_cqe_flush_error": 0, + "resp_remote_access_errors": 0, + "roce_adp_retrans": 0, + "roce_adp_retrans_to": 0, + "roce_slow_restart": 0, + "roce_slow_restart_cnps": 0, + "roce_slow_restart_trans": 0, + "rp_cnp_ignored": 0, + "rx_icrc_encapsulated": 0 + } +] diff --git a/test/unit/plugin/test_amdsmi_analyzer.py b/test/unit/plugin/test_amdsmi_analyzer.py index af7ab0f9..6bc40330 100644 --- a/test/unit/plugin/test_amdsmi_analyzer.py +++ b/test/unit/plugin/test_amdsmi_analyzer.py @@ -24,18 +24,22 @@ # ############################################################################### +from typing import Optional + import pytest from nodescraper.enums import EventPriority from nodescraper.plugins.inband.amdsmi.amdsmi_analyzer import AmdSmiAnalyzer from nodescraper.plugins.inband.amdsmi.amdsmidata import ( AmdSmiDataModel, + AmdSmiMetric, AmdSmiStatic, - AmdSmiTstData, AmdSmiVersion, EccState, Fw, FwListItem, + MetricEccTotals, + MetricPcie, Partition, PartitionCompute, PartitionMemory, @@ -677,43 +681,6 @@ def test_check_expected_xgmi_link_speed_missing_bit_rate(mock_analyzer): assert "XGMI link speed is not available" in analyzer.result.events[0].description -def test_check_amdsmitst_success(mock_analyzer): - """Test check_amdsmitst passes when no tests failed.""" - analyzer = mock_analyzer - - tst_data = AmdSmiTstData( - passed_tests=["test1", "test2", "test3"], - skipped_tests=[], - failed_tests=[], - failed_test_count=0, - ) - - analyzer.check_amdsmitst(tst_data) - - assert len(analyzer.result.events) == 0 - - -def test_check_amdsmitst_failures(mock_analyzer): - """Test check_amdsmitst logs error when tests failed.""" - analyzer = mock_analyzer - - tst_data = AmdSmiTstData( - passed_tests=["test1", "test2"], - skipped_tests=["test3"], - failed_tests=["test4", "test5"], - failed_test_count=2, - ) - - analyzer.check_amdsmitst(tst_data) - - assert len(analyzer.result.events) == 1 - assert analyzer.result.events[0].category == "APPLICATION" - assert analyzer.result.events[0].priority == EventPriority.ERROR - assert "2 failed tests running amdsmitst" in analyzer.result.events[0].description - assert analyzer.result.events[0].data["failed_test_count"] == 2 - assert analyzer.result.events[0].data["failed_tests"] == ["test4", "test5"] - - def test_analyze_data_full_workflow(mock_analyzer): """Test full analyze_data workflow with various checks.""" analyzer = mock_analyzer @@ -764,12 +731,6 @@ def test_analyze_data_full_workflow(mock_analyzer): ), ), ], - amdsmitst_data=AmdSmiTstData( - passed_tests=["test1", "test2"], - skipped_tests=[], - failed_tests=[], - failed_test_count=0, - ), ) args = AmdSmiAnalyzerArgs( @@ -801,3 +762,246 @@ def test_analyze_data_no_static_data(mock_analyzer): assert len(result.events) >= 1 assert any("No AMD SMI static data available" in event.description for event in result.events) + + +# All required keys for MetricPcie / MetricEccTotals (no defaults in model -> must be present). +_PCIE_KEYS = [ + "width", + "speed", + "bandwidth", + "replay_count", + "l0_to_recovery_count", + "replay_roll_over_count", + "nak_sent_count", + "nak_received_count", + "current_bandwidth_sent", + "current_bandwidth_received", + "max_packet_size", + "lc_perf_other_end_recovery", +] +_ECC_TOTALS_KEYS = [ + "total_correctable_count", + "total_uncorrectable_count", + "total_deferred_count", + "cache_correctable_count", + "cache_uncorrectable_count", +] + + +def _pcie_dict(**overrides): + """Full PCIe dict for model_validate; overrides merged on top.""" + base = {k: None for k in _PCIE_KEYS} + for k, v in overrides.items(): + if v is not None and isinstance(v, ValueUnit): + base[k] = {"value": v.value, "unit": v.unit} + else: + base[k] = v + return MetricPcie.model_validate(base) + + +def _ecc_totals_dict(**overrides): + """Full ECC totals dict for model_validate; overrides merged on top.""" + base = {k: None for k in _ECC_TOTALS_KEYS} + base.update(overrides) + return MetricEccTotals.model_validate(base) + + +def _minimal_amdsmi_metric( + gpu: int = 0, + pcie: Optional[MetricPcie] = None, + ecc: Optional[MetricEccTotals] = None, + ecc_blocks: Optional[dict] = None, +) -> AmdSmiMetric: + """Build minimal AmdSmiMetric for PCIe/ECC tests with all required fields present.""" + pcie_dict = pcie.model_dump() if pcie is not None else {k: None for k in _PCIE_KEYS} + ecc_dict = ecc.model_dump() if ecc is not None else {k: None for k in _ECC_TOTALS_KEYS} + return AmdSmiMetric.model_validate( + { + "gpu": gpu, + "usage": { + "gfx_activity": None, + "umc_activity": None, + "mm_activity": None, + "vcn_activity": [], + "jpeg_activity": [], + "gfx_busy_inst": None, + "jpeg_busy": None, + "vcn_busy": None, + }, + "power": { + "socket_power": None, + "gfx_voltage": None, + "soc_voltage": None, + "mem_voltage": None, + "throttle_status": None, + "power_management": None, + }, + "clock": {}, + "temperature": {"edge": None, "hotspot": None, "mem": None}, + "pcie": pcie_dict, + "ecc": ecc_dict, + "ecc_blocks": ecc_blocks if ecc_blocks is not None else {}, + "fan": {"speed": None, "max": None, "rpm": None, "usage": None}, + "voltage_curve": None, + "perf_level": None, + "xgmi_err": None, + "energy": None, + "mem_usage": { + "total_vram": None, + "used_vram": None, + "free_vram": None, + "total_visible_vram": None, + "used_visible_vram": None, + "free_visible_vram": None, + "total_gtt": None, + "used_gtt": None, + "free_gtt": None, + }, + "throttle": {}, + } + ) + + +def test_check_amdsmi_metric_pcie_width_fail(mock_analyzer): + """PCIe width not x16 generates error.""" + analyzer = mock_analyzer + pcie = _pcie_dict(width=8) + metrics = [_minimal_amdsmi_metric(0, pcie=pcie)] + analyzer.check_amdsmi_metric_pcie(metrics, 4, 2) + assert len(analyzer.result.events) == 1 + assert "PCIe width is not x16" in analyzer.result.events[0].description + assert analyzer.result.events[0].data.get("pcie_width") == 8 + assert analyzer.result.events[0].priority == EventPriority.ERROR + assert analyzer.result.events[0].category == "IO" + + +def test_check_amdsmi_metric_pcie_speed_fail(mock_analyzer): + """PCIe speed not Gen5 (32 GT/s) generates error.""" + analyzer = mock_analyzer + pcie = _pcie_dict(width=16, speed=ValueUnit(value=16, unit="GT/s")) + metrics = [_minimal_amdsmi_metric(0, pcie=pcie)] + analyzer.check_amdsmi_metric_pcie(metrics, 4, 2) + assert len(analyzer.result.events) == 1 + assert "PCIe link speed is not Gen5" in analyzer.result.events[0].description + assert analyzer.result.events[0].data.get("pcie_speed") == 16 + assert analyzer.result.events[0].priority == EventPriority.ERROR + + +def test_check_amdsmi_metric_pcie_l0_warning(mock_analyzer): + """L0 recovery count above warning threshold generates warning.""" + analyzer = mock_analyzer + pcie = _pcie_dict(width=16, l0_to_recovery_count=2) + metrics = [_minimal_amdsmi_metric(0, pcie=pcie)] + analyzer.check_amdsmi_metric_pcie(metrics, 4, 1) + assert len(analyzer.result.events) == 1 + assert "L0 recoveries" in analyzer.result.events[0].description + assert analyzer.result.events[0].data.get("l0_to_recovery_count") == 2 + assert analyzer.result.events[0].priority == EventPriority.WARNING + + +def test_check_amdsmi_metric_pcie_l0_error(mock_analyzer): + """L0 recovery count above error threshold generates error.""" + analyzer = mock_analyzer + pcie = _pcie_dict(width=16, l0_to_recovery_count=10) + metrics = [_minimal_amdsmi_metric(0, pcie=pcie)] + analyzer.check_amdsmi_metric_pcie(metrics, 4, 2) + assert len(analyzer.result.events) == 1 + assert "L0 recoveries" in analyzer.result.events[0].description + assert analyzer.result.events[0].data.get("l0_to_recovery_count") == 10 + assert analyzer.result.events[0].priority == EventPriority.ERROR + + +def test_check_amdsmi_metric_pcie_replay_count_warning(mock_analyzer): + """PCIe replay count > 0 generates warning.""" + analyzer = mock_analyzer + pcie = _pcie_dict(width=16, replay_count=10) + metrics = [_minimal_amdsmi_metric(0, pcie=pcie)] + analyzer.check_amdsmi_metric_pcie(metrics, 4, 2) + assert len(analyzer.result.events) == 1 + assert "replay count" in analyzer.result.events[0].description + assert analyzer.result.events[0].data.get("replay_count") == 10 + assert analyzer.result.events[0].priority == EventPriority.WARNING + + +def test_check_amdsmi_metric_pcie_nak_sent_warning(mock_analyzer): + """PCIe NAK sent count > 0 generates warning.""" + analyzer = mock_analyzer + pcie = _pcie_dict(width=16, nak_sent_count=1) + metrics = [_minimal_amdsmi_metric(0, pcie=pcie)] + analyzer.check_amdsmi_metric_pcie(metrics, 4, 2) + assert len(analyzer.result.events) == 1 + assert ( + "NAKs" in analyzer.result.events[0].description + and "sent" in analyzer.result.events[0].description + ) + assert analyzer.result.events[0].data.get("nak_sent_count") == 1 + assert analyzer.result.events[0].priority == EventPriority.WARNING + + +def test_check_amdsmi_metric_pcie_nak_received_warning(mock_analyzer): + """PCIe NAK received count > 0 generates warning.""" + analyzer = mock_analyzer + pcie = _pcie_dict(width=16, nak_received_count=1) + metrics = [_minimal_amdsmi_metric(0, pcie=pcie)] + analyzer.check_amdsmi_metric_pcie(metrics, 4, 2) + assert len(analyzer.result.events) == 1 + assert ( + "NAKs" in analyzer.result.events[0].description + and "received" in analyzer.result.events[0].description + ) + assert analyzer.result.events[0].priority == EventPriority.WARNING + + +def test_check_amdsmi_metric_pcie_pass(mock_analyzer): + """PCIe metrics all OK generates no events.""" + analyzer = mock_analyzer + pcie = _pcie_dict(width=16, speed=ValueUnit(value=32, unit="GT/s")) + metrics = [_minimal_amdsmi_metric(0, pcie=pcie)] + analyzer.check_amdsmi_metric_pcie(metrics, 4, 2) + assert len(analyzer.result.events) == 0 + + +def test_check_amdsmi_metric_ecc_totals(mock_analyzer): + """ECC totals generate expected events.""" + analyzer = mock_analyzer + metrics = [ + _minimal_amdsmi_metric( + 0, ecc=_ecc_totals_dict(total_correctable_count=1, total_uncorrectable_count=0) + ), + _minimal_amdsmi_metric(1, ecc=_ecc_totals_dict(total_uncorrectable_count=1)), + _minimal_amdsmi_metric(2, ecc=_ecc_totals_dict(total_deferred_count=1)), + _minimal_amdsmi_metric(3, ecc=_ecc_totals_dict(cache_correctable_count=1)), + _minimal_amdsmi_metric(4, ecc=_ecc_totals_dict(cache_uncorrectable_count=1)), + ] + analyzer.check_amdsmi_metric_ecc_totals(metrics) + assert len(analyzer.result.events) == 5 + # Analyzer uses generic description "GPU ECC error count detected"; type is in data["error_type"] + for e in analyzer.result.events: + assert e.description == "GPU ECC error count detected" + assert "error_type" in e.data and "error_count" in e.data + error_types = [e.data["error_type"] for e in analyzer.result.events] + assert "Total correctable ECC errors" in error_types + assert "Total uncorrectable ECC errors" in error_types + assert "Total deferred ECC errors" in error_types + assert "Cache correctable ECC errors" in error_types + assert "Cache uncorrectable ECC errors" in error_types + + +def test_check_amdsmi_metric_ecc_blocks(mock_analyzer): + """ECC block-level correctable/uncorrectable/deferred generate events.""" + analyzer = mock_analyzer + ecc_blocks = { + "SDMA": {"correctable_count": 0, "uncorrectable_count": 3, "deferred_count": 0}, + "GFX": {"correctable_count": 2, "uncorrectable_count": 0, "deferred_count": 0}, + "MMHUB": {"correctable_count": 0, "uncorrectable_count": 10, "deferred_count": 1}, + "HDP": {"correctable_count": 8, "uncorrectable_count": 5, "deferred_count": 0}, + } + metrics = [_minimal_amdsmi_metric(0, ecc_blocks=ecc_blocks)] + analyzer.check_amdsmi_metric_ecc(metrics) + events = analyzer.result.events + assert len(events) >= 6 + desc = [e.description for e in events] + assert any("SDMA" in d and "uncorrectable" in d for d in desc) + assert any("GFX" in d and "correctable" in d for d in desc) + assert any("MMHUB" in d for d in desc) + assert any("HDP" in d for d in desc) diff --git a/test/unit/plugin/test_amdsmi_collector.py b/test/unit/plugin/test_amdsmi_collector.py index d6583bae..74b56266 100644 --- a/test/unit/plugin/test_amdsmi_collector.py +++ b/test/unit/plugin/test_amdsmi_collector.py @@ -33,7 +33,9 @@ from nodescraper.enums.systeminteraction import SystemInteractionLevel from nodescraper.plugins.inband.amdsmi.amdsmi_collector import AmdSmiCollector -from nodescraper.plugins.inband.amdsmi.amdsmidata import AmdSmiDataModel +from nodescraper.plugins.inband.amdsmi.amdsmidata import ( + AmdSmiDataModel, +) from nodescraper.plugins.inband.amdsmi.collector_args import AmdSmiCollectorArgs @@ -410,6 +412,85 @@ def test_cache_properties_parsing(collector): assert {"PropertyA", "PropertyB", "PropertyC"}.issubset(set(item.cache_properties)) +def test_static_data_without_vbios_defaults_to_none(conn_mock, system_info, monkeypatch): + """When static JSON has no vbios block, get_static() yields AmdSmiStatic with vbios=None""" + + static_payload = { + "gpu_data": [ + { + "gpu": 0, + "asic": { + "market_name": "SomeGPU", + "vendor_id": "1002", + "vendor_name": "AMD", + "subvendor_id": "1ABC", + "device_id": "0x1234", + "subsystem_id": "0x5678", + "rev_id": "A1", + "asic_serial": "ASERIAL", + "oam_id": 0, + "num_compute_units": 224, + "target_graphics_version": "GFX940", + }, + "board": { + "model_number": "Board-42", + "product_serial": "SN0001", + "fru_id": "FRU-1", + "product_name": "ExampleBoard", + "manufacturer_name": "ACME", + }, + "bus": { + "bdf": "0000:0b:00.0", + "max_pcie_width": 16, + "max_pcie_speed": 16.0, + "pcie_interface_version": "PCIe 5.0", + "slot_type": "PCIe", + }, + "driver": {"driver_name": "amdgpu", "driver_version": "6.1.0"}, + "numa": {"node": 3, "affinity": 0}, + "vram": { + "vram_type": "HBM3", + "vram_vendor": "Micron", + "vram_bit_width": 4096, + "vram_size_mb": 65536, + }, + "cache": { + "cache": [ + { + "cache_level": 1, + "max_num_cu_shared": 8, + "num_cache_instance": 32, + "cache_size": 262144, + "cache_properties": "PropertyA; PropertyB; PropertyC", + } + ] + }, + "clock": {"frequency": [500, 1500, 2000], "current": 1}, + "soc_pstate": {}, + "xgmi_plpd": {}, + } + ] + } + + def mock_run_sut_cmd(cmd: str) -> MagicMock: + if "which amd-smi" in cmd: + return make_cmd_result("/usr/bin/amd-smi") + if "static -g all --json" in cmd: + return make_cmd_result(make_json_response(static_payload)) + return make_cmd_result("") + + c = AmdSmiCollector( + system_info=system_info, + system_interaction_level=SystemInteractionLevel.PASSIVE, + connection=conn_mock, + ) + monkeypatch.setattr(c, "_run_sut_cmd", mock_run_sut_cmd) + + stat = c.get_static() + assert stat is not None and len(stat) == 1 + assert stat[0].vbios is None + + def test_json_parse_error(conn_mock, system_info, monkeypatch): """Test handling of malformed JSON""" diff --git a/test/unit/plugin/test_fabrics_collector.py b/test/unit/plugin/test_fabrics_collector.py index 884a7a88..add8bb50 100644 --- a/test/unit/plugin/test_fabrics_collector.py +++ b/test/unit/plugin/test_fabrics_collector.py @@ -24,8 +24,11 @@ # ############################################################################### +from unittest.mock import MagicMock + import pytest +from nodescraper.enums.executionstatus import ExecutionStatus from nodescraper.enums.systeminteraction import SystemInteractionLevel from nodescraper.plugins.inband.fabrics.fabrics_collector import FabricsCollector from nodescraper.plugins.inband.fabrics.fabricsdata import ( @@ -36,9 +39,6 @@ MstDevice, MstStatus, OfedInfo, - RdmaDevice, - RdmaInfo, - RdmaLink, ) @@ -122,19 +122,6 @@ def collector(system_info, conn_mock): MST_STATUS_EMPTY = "" -# rdma dev output - RoCE devices -RDMA_DEV_OUTPUT = """0: abcdef25s0: node_type ca fw 1.117.1-a-63 node_guid 1234:56ff:890f:1111 sys_image_guid 1234:56ff:890f:1111 -1: abcdef105s0: node_type ca fw 1.117.1-a-63 node_guid 2222:81ff:3333:b450 sys_image_guid 2222:81ff:3333:b450""" - -RDMA_DEV_EMPTY = "" - -# rdma link output - RoCE devices -RDMA_LINK_OUTPUT = """link rocep9s0/1 state DOWN physical_state POLLING netdev benic8p1 -link abcdef25s0/1 state DOWN physical_state POLLING netdev mock7p1 -""" - -RDMA_LINK_EMPTY = "" - def test_parse_ibstat_basic(collector): """Test parsing basic ibstat output""" @@ -264,60 +251,6 @@ def test_parse_mst_status_empty(collector): assert len(mst_status.devices) == 0 -def test_parse_rdma_dev_roce(collector): - """Test parsing rdma dev output with RoCE devices""" - devices = collector._parse_rdma_dev(RDMA_DEV_OUTPUT) - - assert len(devices) == 2 - - # Check first device - device1 = devices[0] - assert device1.device == "abcdef25s0" - assert device1.node_type == "ca" - assert device1.attributes["fw_version"] == "1.117.1-a-63" - assert device1.node_guid == "1234:56ff:890f:1111" - assert device1.sys_image_guid == "1234:56ff:890f:1111" - - # Check second device - device2 = devices[1] - assert device2.device == "abcdef105s0" - assert device2.node_type == "ca" - assert device2.node_guid == "2222:81ff:3333:b450" - assert device2.sys_image_guid == "2222:81ff:3333:b450" - - -def test_parse_rdma_dev_empty(collector): - """Test parsing empty rdma dev output""" - devices = collector._parse_rdma_dev(RDMA_DEV_EMPTY) - assert len(devices) == 0 - - -def test_parse_rdma_link_roce(collector): - """Test parsing rdma link output with RoCE devices""" - links = collector._parse_rdma_link(RDMA_LINK_OUTPUT) - - assert len(links) == 2 - - # Check first link - link1 = next((link for link in links if link.device == "rocep9s0"), None) - assert link1 is not None - assert link1.port == 1 - assert link1.state == "DOWN" - assert link1.physical_state == "POLLING" - assert link1.netdev == "benic8p1" - - # Check second link - link2 = next((link for link in links if link.device == "abcdef25s0"), None) - assert link2 is not None - assert link2.netdev == "mock7p1" - - -def test_parse_rdma_link_empty(collector): - """Test parsing empty rdma link output""" - links = collector._parse_rdma_link(RDMA_LINK_EMPTY) - assert len(links) == 0 - - def test_fabrics_data_model_creation(collector): """Test creating FabricsDataModel with all components""" ibstat_device = IbstatDevice( @@ -352,30 +285,12 @@ def test_fabrics_data_model_creation(collector): ) mst_status = MstStatus(mst_started=True, devices=[mst_device], raw_output=MST_STATUS_OUTPUT) - rdma_device = RdmaDevice( - device="abcdef25s0", - node_type="ca", - node_guid="1234:56ff:890f:1111", - attributes={"fw_version": "1.117.1-a-63"}, - ) - - rdma_link = RdmaLink( - device="abcdef25s0", - port=1, - state="DOWN", - physical_state="POLLING", - netdev="mock7p1", - ) - - rdma_info = RdmaInfo(devices=[rdma_device], links=[rdma_link], raw_output=RDMA_DEV_OUTPUT) - data = FabricsDataModel( ibstat_devices=[ibstat_device], ibv_devices=[ibv_device], ibdev_netdev_mappings=[mapping], ofed_info=ofed_info, mst_status=mst_status, - rdma_info=rdma_info, ) assert len(data.ibstat_devices) == 1 @@ -383,8 +298,6 @@ def test_fabrics_data_model_creation(collector): assert len(data.ibdev_netdev_mappings) == 1 assert data.ofed_info.version == "OFED-internal-25.11-1.2.3" assert len(data.mst_status.devices) == 1 - assert len(data.rdma_info.devices) == 1 - assert len(data.rdma_info.links) == 1 def test_fabrics_data_model_empty(collector): @@ -395,7 +308,6 @@ def test_fabrics_data_model_empty(collector): ibdev_netdev_mappings=[], ofed_info=None, mst_status=None, - rdma_info=None, ) assert len(data.ibstat_devices) == 0 @@ -403,4 +315,54 @@ def test_fabrics_data_model_empty(collector): assert len(data.ibdev_netdev_mappings) == 0 assert data.ofed_info is None assert data.mst_status is None - assert data.rdma_info is None + + +def test_collect_data_detects_slingshot_when_no_ib(collector): + """When IB is absent but Cassini is present, collect Slingshot command outputs.""" + + def run_sut_cmd_side_effect(cmd, *args, **kwargs): + responses = { + "ibstat": MagicMock(exit_code=1, stdout="", command=cmd), + "ibv_devinfo": MagicMock(exit_code=1, stdout="", command=cmd), + "ls -l /sys/class/infiniband/*/device/net": MagicMock( + exit_code=1, stdout="", command=cmd + ), + "ofed_info -s": MagicMock(exit_code=1, stdout="", command=cmd), + "mst start": MagicMock(exit_code=1, stdout="", command=cmd), + "mst status -v": MagicMock(exit_code=1, stdout="", command=cmd), + "lspci | grep -i cassini": MagicMock( + exit_code=0, + stdout="03:00.0 Processing accelerators: Vendor Cassini", + command=cmd, + ), + "ip link show": MagicMock(exit_code=0, stdout="1: lo: ", command=cmd), + "fi_info -p cxi": MagicMock(exit_code=0, stdout="provider: cxi", command=cmd), + "cxi_stat": MagicMock(exit_code=0, stdout="cxi stats output", command=cmd), + "lsmod | grep cxi": MagicMock(exit_code=0, stdout="cxi_core 123 0", command=cmd), + } + return responses[cmd] + + collector._run_sut_cmd = MagicMock(side_effect=run_sut_cmd_side_effect) + + result, data = collector.collect_data() + + assert result.status == ExecutionStatus.OK + assert data is not None + assert data.slingshot_data is not None + assert "Cassini" in data.slingshot_data.cassini_pci + assert data.slingshot_data.libfabric_info == "provider: cxi" + assert data.slingshot_data.cxi_stat == "cxi stats output" + + +def test_collect_data_not_ran_when_no_ib_and_no_slingshot(collector): + """Return NOT_RAN when neither IB nor Slingshot hardware is present.""" + + def run_sut_cmd_side_effect(cmd, *args, **kwargs): + return MagicMock(exit_code=1, stdout="", command=cmd) + + collector._run_sut_cmd = MagicMock(side_effect=run_sut_cmd_side_effect) + + result, data = collector.collect_data() + + assert result.status == ExecutionStatus.NOT_RAN + assert data is None diff --git a/test/unit/plugin/test_network_collector.py b/test/unit/plugin/test_network_collector.py index 2de1374d..6382adeb 100644 --- a/test/unit/plugin/test_network_collector.py +++ b/test/unit/plugin/test_network_collector.py @@ -32,20 +32,11 @@ from nodescraper.models.systeminfo import OSFamily from nodescraper.plugins.inband.network.network_collector import NetworkCollector from nodescraper.plugins.inband.network.networkdata import ( - BroadcomNicDevice, - BroadcomNicQos, EthtoolInfo, IpAddress, Neighbor, NetworkDataModel, NetworkInterface, - PensandoNicCard, - PensandoNicDcqcn, - PensandoNicEnvironment, - PensandoNicPcieAts, - PensandoNicPort, - PensandoNicQos, - PensandoNicQosScheduling, Route, RoutingRule, ) @@ -317,12 +308,6 @@ def run_sut_cmd_side_effect(cmd, **kwargs): elif "lldpcli" in cmd or "lldpctl" in cmd: # LLDP commands fail (not available) return MagicMock(exit_code=1, stdout="", command=cmd) - elif "niccli" in cmd: - # Broadcom NIC commands fail (not available) - return MagicMock(exit_code=1, stdout="", command=cmd) - elif "nicctl" in cmd: - # Pensando NIC commands fail (not available) - return MagicMock(exit_code=1, stdout="", command=cmd) return MagicMock(exit_code=1, stdout="", command=cmd) collector._run_sut_cmd = MagicMock(side_effect=run_sut_cmd_side_effect) @@ -336,8 +321,8 @@ def run_sut_cmd_side_effect(cmd, **kwargs): assert len(data.routes) == 3 assert len(data.rules) == 3 assert len(data.neighbors) == 2 - # Since nicctl commands fail in this test, we expect the failure message - assert "Network data collection failed" in result.message + # Ethtool/LLDP are mocked to fail; collector still reports success + assert "Network data collected successfully" in result.message def test_collect_data_addr_failure(collector, conn_mock): @@ -359,12 +344,6 @@ def run_sut_cmd_side_effect(cmd, **kwargs): elif "lldpcli" in cmd or "lldpctl" in cmd: # LLDP commands fail (not available) return MagicMock(exit_code=1, command=cmd) - elif "niccli" in cmd: - # Broadcom NIC commands fail (not available) - return MagicMock(exit_code=1, command=cmd) - elif "nicctl" in cmd: - # Pensando NIC commands fail (not available) - return MagicMock(exit_code=1, command=cmd) return MagicMock(exit_code=1, command=cmd) collector._run_sut_cmd = MagicMock(side_effect=run_sut_cmd_side_effect) @@ -530,6 +509,23 @@ def test_parse_ethtool_empty_output(collector): assert len(ethtool_info.advertised_link_modes) == 0 +def test_parse_ethtool_statistics(collector): + """Test parsing ethtool -S output (statistics) for error/health analysis.""" + output = """NIC statistics: + [0]: rx_ucast_packets: 162692536538787551 + [0]: rx_errors: 0 + [1]: rx_ucast_packets: 79657418409137764 + rx_total_l4_csum_errors: 0 + rx_total_buf_errors: 0""" + stats = collector._parse_ethtool_statistics(output, "abc1p1") + assert stats.get("netdev") == "abc1p1" + assert stats.get("0_rx_ucast_packets") == "162692536538787551" + assert stats.get("0_rx_errors") == "0" + assert stats.get("1_rx_ucast_packets") == "79657418409137764" + assert stats.get("rx_total_l4_csum_errors") == "0" + assert stats.get("rx_total_buf_errors") == "0" + + def test_network_data_model_creation(collector): """Test creating NetworkDataModel with all components""" interface = NetworkInterface( @@ -572,1295 +568,6 @@ def test_network_data_model_creation(collector): assert data.ethtool_info["ethmock123"].speed == "1000mockMb/s" -# Sample Broadcom NIC command outputs for testing -NICCLI_LISTDEV_OUTPUT = """ -1 ) Broadcom BCM57608 1x400G QSFP-DD PCIe Ethernet NIC (Adp#1 Port#1) - Device Interface Name : abcd1p1 - MAC Address : 81:82:83:84:85:88 - PCI Address : 0000:22:00.0 -""" - -NICCLI_QOS_OUTPUT = """ -IEEE 8021QAZ ETS Configuration TLV: - PRIO_MAP: 0:0 1:0 2:0 3:1 4:0 5:0 6:0 7:2 - TC Bandwidth: 50% 50% 0% - TSA_MAP: 0:ets 1:ets 2:strict -IEEE 8021QAZ PFC TLV: - PFC enabled: 3 -IEEE 8021QAZ APP TLV: - APP#0: - Priority: 7 - Sel: 5 - DSCP: 48 - - APP#1: - Priority: 3 - Sel: 5 - DSCP: 26 - - APP#2: - Priority: 3 - Sel: 3 - UDP or DCCP: 4791 - -TC Rate Limit: 100% 100% 100% 0% 0% 0% 0% 0% -""" - -NICCLI_QOS_MINIMAL_OUTPUT = """IEEE 8021QAZ ETS Configuration TLV: - PRIO_MAP: 0:0 1:1 - TC Bandwidth: 50% 50% - TSA_MAP: 0:ets 1:strict -IEEE 8021QAZ PFC TLV: - PFC enabled: 1 -TC Rate Limit: 100% 100% -""" - -# Sample Pensando NIC command outputs for testing -NICCTL_SHOW_CARD_OUTPUT = """ ---------------------------------------------------------------------------------------------- -Id PCIe BDF ASIC F/W partition Serial number ---------------------------------------------------------------------------------------------- -1111111-4c32-3533-3330-12345000000 0000:06:00.0 test1 A ABC1234 -2222222-4c32-3533-3731-78901500000 0000:16:00.0 test2 A DEF5678 -""" - -NICCTL_SHOW_DCQCN_OUTPUT = """ -NIC : 1111111-4c32-3533-3330-12345000000 (0000:06:00.0) ------------------------------------------------------------------------------------------- - -Lif id : 1111111-4c32-3533-3330-12345000000 -ROCE device : sample - DCQCN profile id : 1 - Status : Disabled -****************************************************************************************** -""" - -NICCTL_SHOW_ENVIRONMENT_OUTPUT = """ -NIC : 1111111-4c32-3533-3330-12345000000 (0000:06:00.0) - - Power(W): - Total power drawn (pin) : 29.437 - Core power (pout1) : 12.375 - ARM power (pout2) : 0.788 - Temperature(C): - Local board temperature : 44.12 - Die temperature : 45.59 - Voltage(mV): - Input voltage : 12078 - Core voltage : 725 - Frequency(MHz): - Core frequency : 1100 - CPU frequency : 1500 - P4 stage frequency : 1500 -------------------------------------------------------------------------------------- -""" - -NICCTL_SHOW_PCIE_ATS_OUTPUT = """ -NIC : 1111111-4c32-3533-3330-12345000000 (0000:06:00.0) : Disabled -""" - -NICCTL_SHOW_PORT_OUTPUT = """ -NIC : 1111111-4c32-3533-3330-12345000000 (0000:06:00.0) - -Port : 555555a-6c40-4242-4242-000011010000 (eth1/1) - Spec: - Ifindex : 0x11010000 - Type : ETH - speed : 400G - Admin state : UP - FEC type : RS - Pause type : PFC - Number of lanes : 4 - MTU : 9216 - TX pause : enabled - RX pause : enabled - Auto negotiation : disabled - Status: - Physical port : 1 - Operational status : DOWN - Link FSM state : SIGNAL_DETECT - FEC type : RS - Cable type : Copper - Number of lanes : 4 - speed : 400G - Auto negotiation : disabled - MAC ID : 0 - MAC channel : 0 - MAC address : 04:90:81:4a:6c:40 - Transceiver type : QSFP_CMIS - Transceiver state : SPROM-READ - Transceiver PID : QSFP-400G-CR4 -------------------------------------------------------------------------------------- -""" - -NICCTL_SHOW_QOS_OUTPUT = """ -NIC : 1111111-4c32-3533-3330-12345000000 (0000:06:00.0) - -Port : 0490814a-6c40-4242-4242-000011010000 - - Classification type : DSCP - - DSCP-to-priority : - DSCP bitmap : 0xffffffffffffffff ==> priority : 0 - DSCP : 0-63 ==> priority : 0 - - - PFC : - PFC priority bitmap : 0x0 - PFC no-drop priorities : - - Scheduling : - -------------------------------------------- - Priority Scheduling Bandwidth Rate-limit - Type (in %age) (in Gbps) - -------------------------------------------- - 0 DWRR 0 N/A -""" - - -def test_parse_niccli_listdev_device(collector): - """Test parsing Broadcom NIC device from niccli --list_devices output""" - devices = collector._parse_niccli_listdev(NICCLI_LISTDEV_OUTPUT) - - assert len(devices) == 1 - - # Check device - device1 = devices[0] - assert device1.device_num == 1 - assert device1.model == "Broadcom BCM57608 1x400G QSFP-DD PCIe Ethernet NIC" - assert device1.adapter_port == "Adp#1 Port#1" - assert device1.interface_name == "abcd1p1" - assert device1.mac_address == "81:82:83:84:85:88" - assert device1.pci_address == "0000:22:00.0" - - -def test_parse_niccli_listdev_empty_output(collector): - """Test parsing empty niccli --list_devices output""" - devices = collector._parse_niccli_listdev("") - - assert len(devices) == 0 - - -def test_parse_niccli_listdev_malformed_output(collector): - """Test parsing malformed niccli --list_devices output gracefully""" - malformed = """some random text -not a valid device line -123 invalid format -""" - - devices = collector._parse_niccli_listdev(malformed) - - # Should handle gracefully, return empty list or skip invalid lines - assert isinstance(devices, list) - - -def test_parse_niccli_qos_complete(collector): - """Test parsing complete Broadcom NIC QoS output with all fields""" - qos = collector._parse_niccli_qos(1, NICCLI_QOS_OUTPUT) - - assert qos.device_num == 1 - assert qos.raw_output == NICCLI_QOS_OUTPUT - - # Check PRIO_MAP - assert len(qos.prio_map) == 8 - assert qos.prio_map[0] == 0 - assert qos.prio_map[1] == 0 - assert qos.prio_map[3] == 1 - assert qos.prio_map[7] == 2 - - # Check TC Bandwidth - assert len(qos.tc_bandwidth) == 3 - assert qos.tc_bandwidth[0] == 50 - assert qos.tc_bandwidth[1] == 50 - assert qos.tc_bandwidth[2] == 0 - - # Check TSA_MAP - assert len(qos.tsa_map) == 3 - assert qos.tsa_map[0] == "ets" - assert qos.tsa_map[1] == "ets" - assert qos.tsa_map[2] == "strict" - - # Check PFC enabled - assert qos.pfc_enabled == 3 - - # Check APP entries - assert len(qos.app_entries) == 3 - - # Check APP#0 - app0 = qos.app_entries[0] - assert app0.priority == 7 - assert app0.sel == 5 - assert app0.dscp == 48 - assert app0.protocol is None - assert app0.port is None - - # Check APP#1 - app1 = qos.app_entries[1] - assert app1.priority == 3 - assert app1.sel == 5 - assert app1.dscp == 26 - - # Check APP#2 (with protocol and port) - app2 = qos.app_entries[2] - assert app2.priority == 3 - assert app2.sel == 3 - assert app2.dscp is None - assert app2.protocol == "UDP or DCCP" - assert app2.port == 4791 - - # Check TC Rate Limit - assert len(qos.tc_rate_limit) == 8 - assert qos.tc_rate_limit[0] == 100 - assert qos.tc_rate_limit[1] == 100 - assert qos.tc_rate_limit[2] == 100 - assert qos.tc_rate_limit[3] == 0 - assert qos.tc_rate_limit[7] == 0 - - -def test_parse_niccli_qos_empty_output(collector): - """Test parsing empty QoS output""" - qos = collector._parse_niccli_qos(1, "") - - assert qos.device_num == 1 - assert qos.raw_output == "" - assert len(qos.prio_map) == 0 - assert len(qos.tc_bandwidth) == 0 - assert len(qos.tsa_map) == 0 - assert qos.pfc_enabled is None - assert len(qos.app_entries) == 0 - assert len(qos.tc_rate_limit) == 0 - - -def test_parse_niccli_qos_multiple_app_protocols(collector): - """Test parsing QoS with APP entries having different protocols""" - qos_multi_protocol = """IEEE 8021QAZ ETS Configuration TLV: - PRIO_MAP: 0:0 - TC Bandwidth: 100% - TSA_MAP: 0:ets -IEEE 8021QAZ PFC TLV: - PFC enabled: 0 -IEEE 8021QAZ APP TLV: - APP#0: - Priority: 5 - Sel: 3 - TCP: 8080 - - APP#1: - Priority: 6 - Sel: 3 - UDP: 9000 - -TC Rate Limit: 100% -""" - - qos = collector._parse_niccli_qos(3, qos_multi_protocol) - - assert len(qos.app_entries) == 2 - - # Check TCP entry - app0 = qos.app_entries[0] - assert app0.priority == 5 - assert app0.sel == 3 - assert app0.protocol == "TCP" - assert app0.port == 8080 - - # Check UDP entry - app1 = qos.app_entries[1] - assert app1.priority == 6 - assert app1.sel == 3 - assert app1.protocol == "UDP" - assert app1.port == 9000 - - -def test_parse_niccli_qos_malformed_values(collector): - """Test parsing QoS output with malformed values gracefully""" - malformed = """IEEE 8021QAZ ETS Configuration TLV: - PRIO_MAP: 0:invalid 1:1 bad:data - TC Bandwidth: 50% invalid 50% - TSA_MAP: 0:ets bad:value 1:strict -IEEE 8021QAZ PFC TLV: - PFC enabled: not_a_number -TC Rate Limit: 100% bad% 100% -""" - - qos = collector._parse_niccli_qos(1, malformed) - - # Should skip invalid entries but parse valid ones - assert qos.device_num == 1 - # Should have parsed valid prio_map entry (1:1) - assert 1 in qos.prio_map - assert qos.prio_map[1] == 1 - # Should have parsed valid bandwidth entries - assert 50 in qos.tc_bandwidth - # Should have parsed valid tsa_map entries - assert qos.tsa_map.get(0) == "ets" - assert qos.tsa_map.get(1) == "strict" - # PFC should be None due to invalid number - assert qos.pfc_enabled is None - - -def test_network_data_model_with_broadcom_nic(collector): - """Test creating NetworkDataModel with Broadcom NIC data""" - device = BroadcomNicDevice( - device_num=1, - model="Broadcom BCM57608 1x400G QSFP-DD PCIe Ethernet NIC", - adapter_port="Adp#1 Port#1", - interface_name="benic1p1", - mac_address="8C:84:74:37:C3:70", - pci_address="0000:06:00.0", - ) - - qos = BroadcomNicQos( - device_num=1, - raw_output="test output", - prio_map={0: 0, 1: 1}, - tc_bandwidth=[50, 50], - tsa_map={0: "ets", 1: "strict"}, - pfc_enabled=3, - tc_rate_limit=[100, 100], - ) - - data = NetworkDataModel( - interfaces=[], - routes=[], - rules=[], - neighbors=[], - ethtool_info={}, - broadcom_nic_devices=[device], - broadcom_nic_qos={1: qos}, - ) - - assert len(data.broadcom_nic_devices) == 1 - assert len(data.broadcom_nic_qos) == 1 - assert data.broadcom_nic_devices[0].device_num == 1 - assert data.broadcom_nic_devices[0].interface_name == "benic1p1" - assert data.broadcom_nic_qos[1].device_num == 1 - assert data.broadcom_nic_qos[1].pfc_enabled == 3 - - -def test_parse_nicctl_show_card_multiple_cards(collector): - """Test parsing multiple Pensando NIC cards from nicctl show card output""" - cards = collector._parse_nicctl_card(NICCTL_SHOW_CARD_OUTPUT) - - assert len(cards) == 2 - - # Check first card - card1 = cards[0] - assert card1.id == "1111111-4c32-3533-3330-12345000000" - assert card1.pcie_bdf == "0000:06:00.0" - assert card1.asic == "test1" - assert card1.fw_partition == "A" - assert card1.serial_number == "ABC1234" - - # Check second card - card2 = cards[1] - assert card2.id == "2222222-4c32-3533-3731-78901500000" - assert card2.pcie_bdf == "0000:16:00.0" - assert card2.asic == "test2" - assert card2.fw_partition == "A" - assert card2.serial_number == "DEF5678" - - -def test_parse_nicctl_show_card_empty_output(collector): - """Test parsing empty nicctl show card output""" - cards = collector._parse_nicctl_card("") - - assert len(cards) == 0 - - -def test_parse_nicctl_show_card_partial_fields(collector): - """Test parsing nicctl show card output with partial fields""" - partial_output = """ ---------------------------------------------------------------------------------------------- -Id PCIe BDF ASIC F/W partition Serial number ---------------------------------------------------------------------------------------------- -42424650-4c32-3533-3330-323934000000 0000:06:00.0 -42424650-4c32-3533-3731-304535000000 0000:16:00.0 salina -""" - - cards = collector._parse_nicctl_card(partial_output) - - assert len(cards) == 2 - - # First card with only ID and PCIe BDF - card1 = cards[0] - assert card1.id == "42424650-4c32-3533-3330-323934000000" - assert card1.pcie_bdf == "0000:06:00.0" - assert card1.asic is None - assert card1.fw_partition is None - assert card1.serial_number is None - - # Second card with ID, PCIe BDF, and ASIC - card2 = cards[1] - assert card2.id == "42424650-4c32-3533-3731-304535000000" - assert card2.pcie_bdf == "0000:16:00.0" - assert card2.asic == "salina" - assert card2.fw_partition is None - assert card2.serial_number is None - - -def test_parse_nicctl_show_card_malformed_output(collector): - """Test parsing malformed nicctl show card output gracefully""" - malformed = """some random text -not a valid card line -123 invalid format -""" - - cards = collector._parse_nicctl_card(malformed) - - # Should handle gracefully, return empty list or skip invalid lines - assert isinstance(cards, list) - # May parse some invalid entries, but should not crash - - -def test_network_data_model_with_pensando_nic(collector): - """Test creating NetworkDataModel with Pensando NIC data""" - card1 = PensandoNicCard( - id="42424650-4c32-3533-3330-323934000000", - pcie_bdf="0000:06:00.0", - asic="salina", - fw_partition="A", - serial_number="FPL25330294", - ) - - card2 = PensandoNicCard( - id="42424650-4c32-3533-3731-304535000000", - pcie_bdf="0000:16:00.0", - asic="salina", - fw_partition="A", - serial_number="FPL253710E5", - ) - - data = NetworkDataModel( - interfaces=[], - routes=[], - rules=[], - neighbors=[], - ethtool_info={}, - pensando_nic_cards=[card1, card2], - ) - - assert len(data.pensando_nic_cards) == 2 - assert data.pensando_nic_cards[0].id == "42424650-4c32-3533-3330-323934000000" - assert data.pensando_nic_cards[0].pcie_bdf == "0000:06:00.0" - assert data.pensando_nic_cards[0].asic == "salina" - assert data.pensando_nic_cards[1].serial_number == "FPL253710E5" - - -def test_collect_pensando_nic_success(collector, conn_mock): - """Test successful collection of Pensando NIC data""" - collector.system_info.os_family = OSFamily.LINUX - - # Mock successful nicctl command execution - def run_sut_cmd_side_effect(cmd, **kwargs): - if "nicctl show card" in cmd: - return MagicMock(exit_code=0, stdout=NICCTL_SHOW_CARD_OUTPUT, command=cmd) - elif "nicctl show dcqcn" in cmd: - return MagicMock(exit_code=0, stdout=NICCTL_SHOW_DCQCN_OUTPUT, command=cmd) - elif "nicctl show environment" in cmd: - return MagicMock(exit_code=0, stdout=NICCTL_SHOW_ENVIRONMENT_OUTPUT, command=cmd) - elif "nicctl show pcie ats" in cmd: - return MagicMock(exit_code=0, stdout=NICCTL_SHOW_PCIE_ATS_OUTPUT, command=cmd) - elif "nicctl show port" in cmd: - return MagicMock(exit_code=0, stdout=NICCTL_SHOW_PORT_OUTPUT, command=cmd) - elif "nicctl show qos" in cmd: - return MagicMock(exit_code=0, stdout=NICCTL_SHOW_QOS_OUTPUT, command=cmd) - elif "nicctl show rdma statistics" in cmd: - return MagicMock(exit_code=0, stdout=NICCTL_SHOW_RDMA_STATISTICS_OUTPUT, command=cmd) - elif "nicctl show version host-software" in cmd: - return MagicMock( - exit_code=0, stdout=NICCTL_SHOW_VERSION_HOST_SOFTWARE_OUTPUT, command=cmd - ) - elif "nicctl show version firmware" in cmd: - return MagicMock(exit_code=0, stdout=NICCTL_SHOW_VERSION_FIRMWARE_OUTPUT, command=cmd) - elif "nicctl" in cmd: - # Other nicctl commands succeed but return empty - return MagicMock(exit_code=0, stdout="", command=cmd) - return MagicMock(exit_code=1, stdout="", command=cmd) - - collector._run_sut_cmd = MagicMock(side_effect=run_sut_cmd_side_effect) - - ( - cards, - dcqcn_entries, - environment_entries, - pcie_ats_entries, - port_entries, - qos_entries, - rdma_statistics_entries, - version_host_software, - version_firmware_entries, - uncollected_commands, - ) = collector._collect_pensando_nic_info() - - # All commands succeeded, so uncollected_commands should be empty - assert len(uncollected_commands) == 0 - - assert len(cards) == 2 - assert cards[0].id == "1111111-4c32-3533-3330-12345000000" - assert cards[0].pcie_bdf == "0000:06:00.0" - assert cards[0].asic == "test1" - assert cards[0].serial_number == "ABC1234" - - assert len(dcqcn_entries) == 1 - assert dcqcn_entries[0].nic_id == "1111111-4c32-3533-3330-12345000000" - assert dcqcn_entries[0].pcie_bdf == "0000:06:00.0" - - assert len(environment_entries) == 1 - assert environment_entries[0].nic_id == "1111111-4c32-3533-3330-12345000000" - assert environment_entries[0].pcie_bdf == "0000:06:00.0" - - assert len(pcie_ats_entries) == 1 - assert pcie_ats_entries[0].nic_id == "1111111-4c32-3533-3330-12345000000" - assert pcie_ats_entries[0].pcie_bdf == "0000:06:00.0" - assert pcie_ats_entries[0].status == "Disabled" - - assert len(port_entries) == 1 - assert port_entries[0].nic_id == "1111111-4c32-3533-3330-12345000000" - assert port_entries[0].pcie_bdf == "0000:06:00.0" - assert port_entries[0].port_name == "eth1/1" - - assert len(qos_entries) == 1 - assert qos_entries[0].nic_id == "1111111-4c32-3533-3330-12345000000" - assert qos_entries[0].pcie_bdf == "0000:06:00.0" - assert qos_entries[0].port_id == "0490814a-6c40-4242-4242-000011010000" - - assert len(rdma_statistics_entries) == 2 - assert rdma_statistics_entries[0].nic_id == "42424650-4c32-3533-3330-323934000000" - assert rdma_statistics_entries[0].pcie_bdf == "0000:06:00.0" - assert len(rdma_statistics_entries[0].statistics) == 2 - - assert version_host_software is not None - assert version_host_software.nicctl == "1.117.1-a-63" - assert version_host_software.ipc_driver == "1.117.1.a.63" - assert version_host_software.ionic_driver == "25.08.4.004" - - assert len(version_firmware_entries) == 2 - assert version_firmware_entries[0].nic_id == "42424650-4c32-3533-3330-323934000000" - assert version_firmware_entries[0].pcie_bdf == "0000:06:00.0" - assert version_firmware_entries[0].cpld == "3.16 (primary)" - - -def test_parse_nicctl_show_dcqcn_multiple_entries(collector): - """Test parsing Pensando NIC DCQCN entry from nicctl show dcqcn output""" - dcqcn_entries = collector._parse_nicctl_dcqcn(NICCTL_SHOW_DCQCN_OUTPUT) - - assert len(dcqcn_entries) == 1 - - # Check entry - entry1 = dcqcn_entries[0] - assert entry1.nic_id == "1111111-4c32-3533-3330-12345000000" - assert entry1.pcie_bdf == "0000:06:00.0" - assert entry1.lif_id == "1111111-4c32-3533-3330-12345000000" - assert entry1.roce_device == "sample" - assert entry1.dcqcn_profile_id == "1" - assert entry1.status == "Disabled" - - -def test_parse_nicctl_show_dcqcn_empty_output(collector): - """Test parsing empty nicctl show dcqcn output""" - dcqcn_entries = collector._parse_nicctl_dcqcn("") - - assert len(dcqcn_entries) == 0 - - -def test_parse_nicctl_show_dcqcn_partial_fields(collector): - """Test parsing nicctl show dcqcn output with partial fields""" - partial_output = """ -NIC : 42424650-4c32-3533-3330-323934000000 (0000:06:00.0) ------------------------------------------------------------------------------------------- - -Lif id : 43000070-0100-0000-4242-0490814a6c40 -****************************************************************************************** -""" - - dcqcn_entries = collector._parse_nicctl_dcqcn(partial_output) - - assert len(dcqcn_entries) == 1 - - # Entry with only NIC ID, PCIe BDF, and Lif ID - entry1 = dcqcn_entries[0] - assert entry1.nic_id == "42424650-4c32-3533-3330-323934000000" - assert entry1.pcie_bdf == "0000:06:00.0" - assert entry1.lif_id == "43000070-0100-0000-4242-0490814a6c40" - assert entry1.roce_device is None - assert entry1.dcqcn_profile_id is None - assert entry1.status is None - - -def test_parse_nicctl_show_dcqcn_malformed_output(collector): - """Test parsing malformed nicctl show dcqcn output gracefully""" - malformed = """some random text -not a valid dcqcn line -123 invalid format -""" - - dcqcn_entries = collector._parse_nicctl_dcqcn(malformed) - - # Should handle gracefully, return empty list - assert isinstance(dcqcn_entries, list) - assert len(dcqcn_entries) == 0 - - -def test_network_data_model_with_pensando_nic_dcqcn(collector): - """Test creating NetworkDataModel with Pensando NIC DCQCN data""" - dcqcn1 = PensandoNicDcqcn( - nic_id="42424650-4c32-3533-3330-323934000000", - pcie_bdf="0000:06:00.0", - lif_id="43000070-0100-0000-4242-0490814a6c40", - roce_device="rocep9s0", - dcqcn_profile_id="1", - status="Disabled", - ) - - dcqcn2 = PensandoNicDcqcn( - nic_id="42424650-4c32-3533-3731-304535000000", - pcie_bdf="0000:16:00.0", - lif_id="43000070-0100-0000-4242-0490815cce50", - roce_device="rocep25s0", - dcqcn_profile_id="1", - status="Disabled", - ) - - data = NetworkDataModel( - interfaces=[], - routes=[], - rules=[], - neighbors=[], - ethtool_info={}, - pensando_nic_dcqcn=[dcqcn1, dcqcn2], - ) - - assert len(data.pensando_nic_dcqcn) == 2 - assert data.pensando_nic_dcqcn[0].nic_id == "42424650-4c32-3533-3330-323934000000" - assert data.pensando_nic_dcqcn[0].pcie_bdf == "0000:06:00.0" - assert data.pensando_nic_dcqcn[0].roce_device == "rocep9s0" - assert data.pensando_nic_dcqcn[1].lif_id == "43000070-0100-0000-4242-0490815cce50" - - -def test_parse_nicctl_show_environment_multiple_entries(collector): - """Test parsing Pensando NIC environment entry from nicctl show environment output""" - environment_entries = collector._parse_nicctl_environment(NICCTL_SHOW_ENVIRONMENT_OUTPUT) - - assert len(environment_entries) == 1 - - # Check entry - entry1 = environment_entries[0] - assert entry1.nic_id == "1111111-4c32-3533-3330-12345000000" - assert entry1.pcie_bdf == "0000:06:00.0" - assert entry1.total_power_drawn == 29.437 - assert entry1.core_power == 12.375 - assert entry1.arm_power == 0.788 - assert entry1.local_board_temperature == 44.12 - assert entry1.die_temperature == 45.59 - assert entry1.input_voltage == 12078 - assert entry1.core_voltage == 725 - assert entry1.core_frequency == 1100 - assert entry1.cpu_frequency == 1500 - assert entry1.p4_stage_frequency == 1500 - - -def test_parse_nicctl_show_environment_empty_output(collector): - """Test parsing empty nicctl show environment output""" - environment_entries = collector._parse_nicctl_environment("") - - assert len(environment_entries) == 0 - - -def test_parse_nicctl_show_environment_partial_fields(collector): - """Test parsing nicctl show environment output with partial fields""" - partial_output = """ -NIC : 42424650-4c32-3533-3330-323934000000 (0000:06:00.0) - - Power(W): - Total power drawn (pin) : 29.437 - Temperature(C): - Local board temperature : 44.12 -------------------------------------------------------------------------------------- -""" - - environment_entries = collector._parse_nicctl_environment(partial_output) - - assert len(environment_entries) == 1 - - # Entry with only some fields - entry1 = environment_entries[0] - assert entry1.nic_id == "42424650-4c32-3533-3330-323934000000" - assert entry1.pcie_bdf == "0000:06:00.0" - assert entry1.total_power_drawn == 29.437 - assert entry1.local_board_temperature == 44.12 - assert entry1.core_power is None - assert entry1.die_temperature is None - assert entry1.input_voltage is None - - -def test_parse_nicctl_show_environment_malformed_output(collector): - """Test parsing malformed nicctl show environment output gracefully""" - malformed = """some random text -not a valid environment line -123 invalid format -""" - - environment_entries = collector._parse_nicctl_environment(malformed) - - # Should handle gracefully, return empty list - assert isinstance(environment_entries, list) - assert len(environment_entries) == 0 - - -def test_network_data_model_with_pensando_nic_environment(collector): - """Test creating NetworkDataModel with Pensando NIC environment data""" - env1 = PensandoNicEnvironment( - nic_id="42424650-4c32-3533-3330-323934000000", - pcie_bdf="0000:06:00.0", - total_power_drawn=29.437, - core_power=12.375, - arm_power=0.788, - local_board_temperature=44.12, - die_temperature=45.59, - input_voltage=12078, - core_voltage=725, - core_frequency=1100, - cpu_frequency=1500, - p4_stage_frequency=1500, - ) - - env2 = PensandoNicEnvironment( - nic_id="42424650-4c32-3533-3731-304535000000", - pcie_bdf="0000:16:00.0", - total_power_drawn=28.968, - core_power=12.031, - arm_power=0.292, - local_board_temperature=42.62, - die_temperature=42.28, - input_voltage=12078, - core_voltage=725, - core_frequency=1100, - cpu_frequency=1500, - p4_stage_frequency=1500, - ) - - data = NetworkDataModel( - interfaces=[], - routes=[], - rules=[], - neighbors=[], - ethtool_info={}, - pensando_nic_environment=[env1, env2], - ) - - assert len(data.pensando_nic_environment) == 2 - assert data.pensando_nic_environment[0].nic_id == "42424650-4c32-3533-3330-323934000000" - assert data.pensando_nic_environment[0].pcie_bdf == "0000:06:00.0" - assert data.pensando_nic_environment[0].total_power_drawn == 29.437 - assert data.pensando_nic_environment[0].die_temperature == 45.59 - assert data.pensando_nic_environment[1].core_frequency == 1100 - - -def test_parse_nicctl_show_pcie_ats_multiple_entries(collector): - """Test parsing Pensando NIC PCIe ATS entry from nicctl show pcie ats output""" - pcie_ats_entries = collector._parse_nicctl_pcie_ats(NICCTL_SHOW_PCIE_ATS_OUTPUT) - - assert len(pcie_ats_entries) == 1 - - # Check entry - entry1 = pcie_ats_entries[0] - assert entry1.nic_id == "1111111-4c32-3533-3330-12345000000" - assert entry1.pcie_bdf == "0000:06:00.0" - assert entry1.status == "Disabled" - - -def test_parse_nicctl_show_pcie_ats_empty_output(collector): - """Test parsing empty nicctl show pcie ats output""" - pcie_ats_entries = collector._parse_nicctl_pcie_ats("") - - assert len(pcie_ats_entries) == 0 - - -def test_parse_nicctl_show_pcie_ats_enabled(collector): - """Test parsing nicctl show pcie ats output with Enabled status""" - enabled_output = """ -NIC : 42424650-4c32-3533-3330-323934000000 (0000:06:00.0) : Enabled -NIC : 42424650-4c32-3533-3731-304535000000 (0000:16:00.0) : Disabled -""" - - pcie_ats_entries = collector._parse_nicctl_pcie_ats(enabled_output) - - assert len(pcie_ats_entries) == 2 - assert pcie_ats_entries[0].status == "Enabled" - assert pcie_ats_entries[1].status == "Disabled" - - -def test_parse_nicctl_show_pcie_ats_malformed_output(collector): - """Test parsing malformed nicctl show pcie ats output gracefully""" - malformed = """some random text -not a valid pcie ats line -123 invalid format -""" - - pcie_ats_entries = collector._parse_nicctl_pcie_ats(malformed) - - # Should handle gracefully, return empty list - assert isinstance(pcie_ats_entries, list) - assert len(pcie_ats_entries) == 0 - - -def test_network_data_model_with_pensando_nic_pcie_ats(collector): - """Test creating NetworkDataModel with Pensando NIC PCIe ATS data""" - ats1 = PensandoNicPcieAts( - nic_id="42424650-4c32-3533-3330-323934000000", - pcie_bdf="0000:06:00.0", - status="Disabled", - ) - - ats2 = PensandoNicPcieAts( - nic_id="42424650-4c32-3533-3731-304535000000", - pcie_bdf="0000:16:00.0", - status="Enabled", - ) - - data = NetworkDataModel( - interfaces=[], - routes=[], - rules=[], - neighbors=[], - ethtool_info={}, - pensando_nic_pcie_ats=[ats1, ats2], - ) - - assert len(data.pensando_nic_pcie_ats) == 2 - assert data.pensando_nic_pcie_ats[0].nic_id == "42424650-4c32-3533-3330-323934000000" - assert data.pensando_nic_pcie_ats[0].pcie_bdf == "0000:06:00.0" - assert data.pensando_nic_pcie_ats[0].status == "Disabled" - assert data.pensando_nic_pcie_ats[1].status == "Enabled" - - -def test_parse_nicctl_show_port_multiple_entries(collector): - """Test parsing Pensando NIC port entry from nicctl show port output""" - port_entries = collector._parse_nicctl_port(NICCTL_SHOW_PORT_OUTPUT) - - assert len(port_entries) == 1 - - # Check entry - entry1 = port_entries[0] - assert entry1.nic_id == "1111111-4c32-3533-3330-12345000000" - assert entry1.pcie_bdf == "0000:06:00.0" - assert entry1.port_id == "555555a-6c40-4242-4242-000011010000" - assert entry1.port_name == "eth1/1" - # Spec fields - assert entry1.spec_ifindex == "0x11010000" - assert entry1.spec_type == "ETH" - assert entry1.spec_speed == "400G" - assert entry1.spec_admin_state == "UP" - assert entry1.spec_fec_type == "RS" - assert entry1.spec_pause_type == "PFC" - assert entry1.spec_num_lanes == 4 - assert entry1.spec_mtu == 9216 - assert entry1.spec_tx_pause == "enabled" - assert entry1.spec_rx_pause == "enabled" - assert entry1.spec_auto_negotiation == "disabled" - # Status fields - assert entry1.status_physical_port == 1 - assert entry1.status_operational_status == "DOWN" - assert entry1.status_link_fsm_state == "SIGNAL_DETECT" - assert entry1.status_fec_type == "RS" - assert entry1.status_cable_type == "Copper" - assert entry1.status_num_lanes == 4 - assert entry1.status_speed == "400G" - assert entry1.status_auto_negotiation == "disabled" - assert entry1.status_mac_id == 0 - assert entry1.status_mac_channel == 0 - assert entry1.status_mac_address == "04:90:81:4a:6c:40" - assert entry1.status_transceiver_type == "QSFP_CMIS" - assert entry1.status_transceiver_state == "SPROM-READ" - assert entry1.status_transceiver_pid == "QSFP-400G-CR4" - - -def test_parse_nicctl_show_port_empty_output(collector): - """Test parsing empty nicctl show port output""" - port_entries = collector._parse_nicctl_port("") - - assert len(port_entries) == 0 - - -def test_parse_nicctl_show_port_partial_fields(collector): - """Test parsing nicctl show port output with partial fields""" - partial_output = """ -NIC : 42424650-4c32-3533-3330-323934000000 (0000:06:00.0) - -Port : 0490814a-6c40-4242-4242-000011010000 (eth1/1) - Spec: - speed : 400G - Admin state : UP - Status: - Operational status : DOWN -------------------------------------------------------------------------------------- -""" - - port_entries = collector._parse_nicctl_port(partial_output) - - assert len(port_entries) == 1 - - # Entry with only some fields - entry1 = port_entries[0] - assert entry1.nic_id == "42424650-4c32-3533-3330-323934000000" - assert entry1.pcie_bdf == "0000:06:00.0" - assert entry1.port_name == "eth1/1" - assert entry1.spec_speed == "400G" - assert entry1.spec_admin_state == "UP" - assert entry1.status_operational_status == "DOWN" - assert entry1.spec_mtu is None - assert entry1.status_mac_address is None - - -def test_parse_nicctl_show_port_malformed_output(collector): - """Test parsing malformed nicctl show port output gracefully""" - malformed = """some random text -not a valid port line -123 invalid format -""" - - port_entries = collector._parse_nicctl_port(malformed) - - # Should handle gracefully, return empty list - assert isinstance(port_entries, list) - assert len(port_entries) == 0 - - -def test_network_data_model_with_pensando_nic_port(collector): - """Test creating NetworkDataModel with Pensando NIC port data""" - port1 = PensandoNicPort( - nic_id="42424650-4c32-3533-3330-323934000000", - pcie_bdf="0000:06:00.0", - port_id="0490814a-6c40-4242-4242-000011010000", - port_name="eth1/1", - spec_speed="400G", - spec_admin_state="UP", - spec_mtu=9216, - status_operational_status="DOWN", - status_mac_address="04:90:81:4a:6c:40", - ) - - port2 = PensandoNicPort( - nic_id="42424650-4c32-3533-3731-304535000000", - pcie_bdf="0000:16:00.0", - port_id="0490815c-ce50-4242-4242-000011010000", - port_name="eth1/1", - spec_speed="400G", - spec_admin_state="UP", - spec_mtu=9216, - status_operational_status="UP", - status_mac_address="04:90:81:5c:ce:50", - ) - - data = NetworkDataModel( - interfaces=[], - routes=[], - rules=[], - neighbors=[], - ethtool_info={}, - pensando_nic_ports=[port1, port2], - ) - - assert len(data.pensando_nic_ports) == 2 - assert data.pensando_nic_ports[0].nic_id == "42424650-4c32-3533-3330-323934000000" - assert data.pensando_nic_ports[0].port_name == "eth1/1" - assert data.pensando_nic_ports[0].spec_speed == "400G" - assert data.pensando_nic_ports[0].status_mac_address == "04:90:81:4a:6c:40" - assert data.pensando_nic_ports[1].status_operational_status == "UP" - - -def test_parse_nicctl_show_qos_multiple_entries(collector): - """Test parsing Pensando NIC QoS entry from nicctl show qos output""" - qos_entries = collector._parse_nicctl_qos(NICCTL_SHOW_QOS_OUTPUT) - - assert len(qos_entries) == 1 - - # Check entry - entry1 = qos_entries[0] - assert entry1.nic_id == "1111111-4c32-3533-3330-12345000000" - assert entry1.pcie_bdf == "0000:06:00.0" - assert entry1.port_id == "0490814a-6c40-4242-4242-000011010000" - assert entry1.classification_type == "DSCP" - assert entry1.dscp_bitmap == "0xffffffffffffffff" - assert entry1.dscp_range == "0-63" - assert entry1.dscp_priority == 0 - assert entry1.pfc_priority_bitmap == "0x0" - assert entry1.pfc_no_drop_priorities == "" - assert len(entry1.scheduling) == 1 - assert entry1.scheduling[0].priority == 0 - assert entry1.scheduling[0].scheduling_type == "DWRR" - assert entry1.scheduling[0].bandwidth == 0 - assert entry1.scheduling[0].rate_limit == "N/A" - - -def test_parse_nicctl_show_qos_empty_output(collector): - """Test parsing empty nicctl show qos output""" - qos_entries = collector._parse_nicctl_qos("") - - assert len(qos_entries) == 0 - - -def test_parse_nicctl_show_qos_malformed_output(collector): - """Test parsing malformed nicctl show qos output gracefully""" - malformed = """some random text -not a valid qos line -123 invalid format -""" - - qos_entries = collector._parse_nicctl_qos(malformed) - - # Should handle gracefully, return empty list - assert isinstance(qos_entries, list) - assert len(qos_entries) == 0 - - -def test_network_data_model_with_pensando_nic_qos(collector): - """Test creating NetworkDataModel with Pensando NIC QoS data""" - sched1 = PensandoNicQosScheduling( - priority=0, - scheduling_type="DWRR", - bandwidth=0, - rate_limit="N/A", - ) - - qos1 = PensandoNicQos( - nic_id="42424650-4c32-3533-3330-323934000000", - pcie_bdf="0000:06:00.0", - port_id="0490814a-6c40-4242-4242-000011010000", - classification_type="DSCP", - dscp_bitmap="0xffffffffffffffff", - dscp_range="0-63", - dscp_priority=0, - pfc_priority_bitmap="0x0", - pfc_no_drop_priorities="", - scheduling=[sched1], - ) - - qos2 = PensandoNicQos( - nic_id="42424650-4c32-3533-3731-304535000000", - pcie_bdf="0000:16:00.0", - port_id="0490815c-ce50-4242-4242-000011010000", - classification_type="DSCP", - ) - - data = NetworkDataModel( - interfaces=[], - routes=[], - rules=[], - neighbors=[], - ethtool_info={}, - pensando_nic_qos=[qos1, qos2], - ) - - assert len(data.pensando_nic_qos) == 2 - assert data.pensando_nic_qos[0].nic_id == "42424650-4c32-3533-3330-323934000000" - assert data.pensando_nic_qos[0].port_id == "0490814a-6c40-4242-4242-000011010000" - assert data.pensando_nic_qos[0].classification_type == "DSCP" - assert len(data.pensando_nic_qos[0].scheduling) == 1 - assert data.pensando_nic_qos[1].nic_id == "42424650-4c32-3533-3731-304535000000" - - -# Mock output for 'nicctl show rdma statistics' -NICCTL_SHOW_RDMA_STATISTICS_OUTPUT = """NIC : 42424650-4c32-3533-3330-323934000000 (0000:06:00.0) - ------------------------------------------------------------- -Name Count ------------------------------------------------------------- -Queue pair create 1 -Completion queue create 2 - -NIC : 42424650-4c32-3533-3731-304535000000 (0000:16:00.0) - ------------------------------------------------------------- -Name Count ------------------------------------------------------------- -Queue pair create 1 -Completion queue create 2 -""" - - -def test_parse_nicctl_show_rdma_statistics_multiple_entries(collector): - """Test parsing multiple NIC RDMA statistics entries.""" - entries = collector._parse_nicctl_rdma_statistics(NICCTL_SHOW_RDMA_STATISTICS_OUTPUT) - - assert len(entries) == 2 - - # Check first entry - assert entries[0].nic_id == "42424650-4c32-3533-3330-323934000000" - assert entries[0].pcie_bdf == "0000:06:00.0" - assert len(entries[0].statistics) == 2 - assert entries[0].statistics[0].name == "Queue pair create" - assert entries[0].statistics[0].count == 1 - assert entries[0].statistics[1].name == "Completion queue create" - assert entries[0].statistics[1].count == 2 - - # Check second entry - assert entries[1].nic_id == "42424650-4c32-3533-3731-304535000000" - assert entries[1].pcie_bdf == "0000:16:00.0" - assert len(entries[1].statistics) == 2 - assert entries[1].statistics[0].name == "Queue pair create" - assert entries[1].statistics[0].count == 1 - assert entries[1].statistics[1].name == "Completion queue create" - assert entries[1].statistics[1].count == 2 - - -def test_parse_nicctl_show_rdma_statistics_empty_output(collector): - """Test parsing empty RDMA statistics output.""" - entries = collector._parse_nicctl_rdma_statistics("") - assert len(entries) == 0 - - -# Mock output for 'nicctl show version host-software' -NICCTL_SHOW_VERSION_HOST_SOFTWARE_OUTPUT = """nicctl : 1.117.1-a-63 -IPC driver : 1.117.1.a.63 -ionic driver : 25.08.4.004 -""" - - -def test_parse_nicctl_show_version_host_software(collector): - """Test parsing host software version.""" - version = collector._parse_nicctl_version_host_software( - NICCTL_SHOW_VERSION_HOST_SOFTWARE_OUTPUT - ) - - assert version is not None - assert version.nicctl == "1.117.1-a-63" - assert version.ipc_driver == "1.117.1.a.63" - assert version.ionic_driver == "25.08.4.004" - - -def test_parse_nicctl_show_version_host_software_empty_output(collector): - """Test parsing empty host software version output.""" - version = collector._parse_nicctl_version_host_software("") - assert version is None - - -# Mock output for 'nicctl show version firmware' -NICCTL_SHOW_VERSION_FIRMWARE_OUTPUT = """NIC : 42424650-4c32-3533-3330-323934000000 (0000:06:00.0) - -CPLD : 3.16 (primary) -Boot0 : 21 -Uboot-A : 1.117.1-a-63 -Firmware-A : 1.117.1-a-63 -Device config-A : device_config_rdma_1x400G/1.0.0 -------------------------------------------------------------------------------------- - -NIC : 42424650-4c32-3533-3731-304535000000 (0000:16:00.0) - -CPLD : 3.16 (primary) -Boot0 : 21 -Uboot-A : 1.117.1-a-63 -Firmware-A : 1.117.1-a-63 -Device config-A : device_config_rdma_1x400G/1.0.0 -------------------------------------------------------------------------------------- -""" - - -def test_parse_nicctl_show_version_firmware_multiple_entries(collector): - """Test parsing multiple NIC firmware version entries.""" - entries = collector._parse_nicctl_version_firmware(NICCTL_SHOW_VERSION_FIRMWARE_OUTPUT) - - assert len(entries) == 2 - - # Check first entry - assert entries[0].nic_id == "42424650-4c32-3533-3330-323934000000" - assert entries[0].pcie_bdf == "0000:06:00.0" - assert entries[0].cpld == "3.16 (primary)" - assert entries[0].boot0 == "21" - assert entries[0].uboot_a == "1.117.1-a-63" - assert entries[0].firmware_a == "1.117.1-a-63" - assert entries[0].device_config_a == "device_config_rdma_1x400G/1.0.0" - - # Check second entry - assert entries[1].nic_id == "42424650-4c32-3533-3731-304535000000" - assert entries[1].pcie_bdf == "0000:16:00.0" - assert entries[1].cpld == "3.16 (primary)" - assert entries[1].boot0 == "21" - assert entries[1].uboot_a == "1.117.1-a-63" - assert entries[1].firmware_a == "1.117.1-a-63" - assert entries[1].device_config_a == "device_config_rdma_1x400G/1.0.0" - - -def test_parse_nicctl_show_version_firmware_empty_output(collector): - """Test parsing empty firmware version output.""" - entries = collector._parse_nicctl_version_firmware("") - assert len(entries) == 0 - - -def test_network_data_model_with_pensando_nic_rdma_statistics(): - """Test NetworkDataModel with Pensando NIC RDMA statistics.""" - from nodescraper.plugins.inband.network.networkdata import ( - NetworkDataModel, - PensandoNicRdmaStatistic, - PensandoNicRdmaStatistics, - ) - - data = NetworkDataModel( - pensando_nic_rdma_statistics=[ - PensandoNicRdmaStatistics( - nic_id="42424650-4c32-3533-3330-323934000000", - pcie_bdf="0000:06:00.0", - statistics=[ - PensandoNicRdmaStatistic(name="Queue pair create", count=1), - PensandoNicRdmaStatistic(name="Completion queue create", count=2), - ], - ) - ] - ) - - assert len(data.pensando_nic_rdma_statistics) == 1 - assert data.pensando_nic_rdma_statistics[0].nic_id == "42424650-4c32-3533-3330-323934000000" - assert len(data.pensando_nic_rdma_statistics[0].statistics) == 2 - - -def test_network_data_model_with_pensando_nic_version_host_software(): - """Test NetworkDataModel with Pensando NIC host software version.""" - from nodescraper.plugins.inband.network.networkdata import ( - NetworkDataModel, - PensandoNicVersionHostSoftware, - ) - - data = NetworkDataModel( - pensando_nic_version_host_software=PensandoNicVersionHostSoftware( - nicctl="1.117.1-a-63", - ipc_driver="1.117.1.a.63", - ionic_driver="25.08.4.004", - ) - ) - - assert data.pensando_nic_version_host_software is not None - assert data.pensando_nic_version_host_software.nicctl == "1.117.1-a-63" - assert data.pensando_nic_version_host_software.ipc_driver == "1.117.1.a.63" - assert data.pensando_nic_version_host_software.ionic_driver == "25.08.4.004" - - -def test_network_data_model_with_pensando_nic_version_firmware(): - """Test NetworkDataModel with Pensando NIC firmware versions.""" - from nodescraper.plugins.inband.network.networkdata import ( - NetworkDataModel, - PensandoNicVersionFirmware, - ) - - data = NetworkDataModel( - pensando_nic_version_firmware=[ - PensandoNicVersionFirmware( - nic_id="42424650-4c32-3533-3330-323934000000", - pcie_bdf="0000:06:00.0", - cpld="3.16 (primary)", - boot0="21", - uboot_a="1.117.1-a-63", - firmware_a="1.117.1-a-63", - device_config_a="device_config_rdma_1x400G/1.0.0", - ) - ] - ) - - assert len(data.pensando_nic_version_firmware) == 1 - assert data.pensando_nic_version_firmware[0].nic_id == "42424650-4c32-3533-3330-323934000000" - assert data.pensando_nic_version_firmware[0].cpld == "3.16 (primary)" - - def test_network_accessibility_linux_success(collector, conn_mock): """Test network accessibility check on Linux with successful ping""" collector.system_info.os_family = OSFamily.LINUX diff --git a/test/unit/plugin/test_niccli_collector.py b/test/unit/plugin/test_niccli_collector.py new file mode 100644 index 00000000..c4e5adef --- /dev/null +++ b/test/unit/plugin/test_niccli_collector.py @@ -0,0 +1,270 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2025 Advanced Micro Devices, Inc. +# +############################################################################### +from unittest.mock import MagicMock + +import pytest + +from nodescraper.enums.executionstatus import ExecutionStatus +from nodescraper.enums.systeminteraction import SystemInteractionLevel +from nodescraper.models.systeminfo import OSFamily +from nodescraper.plugins.inband.nic.nic_collector import NicCollector +from nodescraper.plugins.inband.nic.nic_data import ( + NicCliDevice, + NicCliQos, + NicDataModel, + PensandoNicCard, +) + + +@pytest.fixture +def collector(system_info, conn_mock): + return NicCollector( + system_info=system_info, + system_interaction_level=SystemInteractionLevel.PASSIVE, + connection=conn_mock, + ) + + +NICCLI_LISTDEV_OUTPUT = """1) Broadcom BCM57608 1x400G QSFP-DD PCIe Ethernet NIC (Adp#1 Port#1) + Device Interface : abcd1p1 + MAC Address : 81:82:83:84:85:88 + PCI Address : 0000:22:00.0 +""" + +NICCLI_QOS_OUTPUT = """IEEE 8021QAZ ETS Configuration TLV: + PRIO_MAP: 0:0 1:0 2:0 3:1 4:0 5:0 6:0 7:2 + TC Bandwidth: 50% 50% 0% + TSA_MAP: 0:ets 1:ets 2:strict +IEEE 8021QAZ PFC TLV: + PFC enabled: 3 +IEEE 8021QAZ APP TLV: + APP#0: + Priority: 7 + Sel: 5 + DSCP: 48 + + APP#1: + Priority: 3 + Sel: 5 + DSCP: 26 + + APP#2: + Priority: 3 + Sel: 3 + UDP or DCCP: 4791 + +TC Rate Limit: 100% 100% 100% 0% 0% 0% 0% 0% +""" + + +def test_parse_niccli_listdev_device(collector): + """Test parsing Broadcom NIC device from niccli --list_devices output.""" + devices = collector._parse_niccli_listdev(NICCLI_LISTDEV_OUTPUT) + + assert len(devices) == 1 + device1 = devices[0] + assert device1.device_num == 1 + assert device1.model == "Broadcom BCM57608 1x400G QSFP-DD PCIe Ethernet NIC" + assert device1.adapter_port == "Adp#1 Port#1" + assert device1.interface_name == "abcd1p1" + assert device1.mac_address == "81:82:83:84:85:88" + assert device1.pci_address == "0000:22:00.0" + + +def test_parse_niccli_listdev_empty_output(collector): + """Test parsing empty niccli --list_devices output.""" + devices = collector._parse_niccli_listdev("") + assert len(devices) == 0 + + +def test_parse_niccli_listdev_malformed_output(collector): + """Test parsing malformed niccli --list_devices output gracefully.""" + malformed = """some random text +not a valid device line +123 invalid format +""" + devices = collector._parse_niccli_listdev(malformed) + assert isinstance(devices, list) + + +def test_parse_niccli_qos_complete(collector): + """Test parsing complete Broadcom NIC QoS output with all fields.""" + qos = collector._parse_niccli_qos(1, NICCLI_QOS_OUTPUT) + + assert qos.device_num == 1 + assert qos.raw_output == NICCLI_QOS_OUTPUT + assert len(qos.prio_map) == 8 + assert qos.prio_map[0] == 0 + assert qos.prio_map[3] == 1 + assert qos.prio_map[7] == 2 + assert len(qos.tc_bandwidth) == 3 + assert qos.tc_bandwidth[0] == 50 + assert qos.tc_bandwidth[1] == 50 + assert qos.tc_bandwidth[2] == 0 + assert len(qos.tsa_map) == 3 + assert qos.tsa_map[0] == "ets" + assert qos.tsa_map[2] == "strict" + assert qos.pfc_enabled == 3 + assert len(qos.app_entries) == 3 + assert qos.app_entries[0].priority == 7 + assert qos.app_entries[0].sel == 5 + assert qos.app_entries[0].dscp == 48 + assert qos.app_entries[2].protocol == "UDP or DCCP" + assert qos.app_entries[2].port == 4791 + assert len(qos.tc_rate_limit) == 8 + assert qos.tc_rate_limit[0] == 100 + + +def test_parse_niccli_qos_empty_output(collector): + """Test parsing empty QoS output.""" + qos = collector._parse_niccli_qos(1, "") + assert qos.device_num == 1 + assert qos.raw_output == "" + assert len(qos.prio_map) == 0 + assert len(qos.tc_bandwidth) == 0 + assert len(qos.tsa_map) == 0 + assert qos.pfc_enabled is None + assert len(qos.app_entries) == 0 + assert len(qos.tc_rate_limit) == 0 + + +def test_parse_niccli_qos_multiple_app_protocols(collector): + """Test parsing QoS with APP entries having different protocols.""" + qos_multi_protocol = """IEEE 8021QAZ ETS Configuration TLV: + PRIO_MAP: 0:0 + TC Bandwidth: 100% + TSA_MAP: 0:ets +IEEE 8021QAZ PFC TLV: + PFC enabled: 0 +IEEE 8021QAZ APP TLV: + APP#0: + Priority: 5 + Sel: 3 + TCP: 8080 + + APP#1: + Priority: 6 + Sel: 3 + UDP: 9000 + +TC Rate Limit: 100% +""" + qos = collector._parse_niccli_qos(3, qos_multi_protocol) + assert len(qos.app_entries) == 2 + assert qos.app_entries[0].priority == 5 + assert qos.app_entries[0].sel == 3 + assert qos.app_entries[0].protocol == "TCP" + assert qos.app_entries[0].port == 8080 + assert qos.app_entries[1].priority == 6 + assert qos.app_entries[1].protocol == "UDP" + assert qos.app_entries[1].port == 9000 + + +def test_parse_niccli_qos_malformed_values(collector): + """Test parsing QoS output with malformed values gracefully.""" + malformed = """IEEE 8021QAZ ETS Configuration TLV: + PRIO_MAP: 0:invalid 1:1 bad:data + TC Bandwidth: 50% invalid 50% + TSA_MAP: 0:ets bad:value 1:strict +IEEE 8021QAZ PFC TLV: + PFC enabled: not_a_number +TC Rate Limit: 100% bad% 100% +""" + qos = collector._parse_niccli_qos(1, malformed) + assert qos.device_num == 1 + assert 1 in qos.prio_map + assert qos.prio_map[1] == 1 + assert 50 in qos.tc_bandwidth + assert qos.tsa_map.get(0) == "ets" + assert qos.tsa_map.get(1) == "strict" + assert qos.pfc_enabled is None + + +def test_nic_data_model_with_broadcom_nic(collector): + """Test creating NicDataModel with Broadcom NIC data.""" + device = NicCliDevice( + device_num=1, + model="Broadcom BCM57608 1x400G QSFP-DD PCIe Ethernet NIC", + adapter_port="Adp#1 Port#1", + interface_name="benic1p1", + mac_address="8C:84:74:37:C3:70", + pci_address="0000:06:00.0", + ) + qos = NicCliQos( + device_num=1, + raw_output="test output", + prio_map={0: 0, 1: 1}, + tc_bandwidth=[50, 50], + tsa_map={0: "ets", 1: "strict"}, + pfc_enabled=3, + tc_rate_limit=[100, 100], + ) + data = NicDataModel( + broadcom_nic_devices=[device], + broadcom_nic_qos={1: qos}, + ) + assert len(data.broadcom_nic_devices) == 1 + assert len(data.broadcom_nic_qos) == 1 + assert data.broadcom_nic_devices[0].device_num == 1 + assert data.broadcom_nic_devices[0].interface_name == "benic1p1" + assert data.broadcom_nic_qos[1].device_num == 1 + assert data.broadcom_nic_qos[1].pfc_enabled == 3 + + +def test_nic_data_model_with_pensando_nic(collector): + """Test creating NicDataModel with Pensando NIC data.""" + card1 = PensandoNicCard( + id="42424650-4c32-3533-3330-323934000000", + pcie_bdf="0000:06:00.0", + asic="salina", + fw_partition="A", + serial_number="FPL25330294", + ) + card2 = PensandoNicCard( + id="42424650-4c32-3533-3731-304535000000", + pcie_bdf="0000:16:00.0", + asic="salina", + fw_partition="A", + serial_number="FPL253710E5", + ) + data = NicDataModel( + pensando_nic_cards=[card1, card2], + ) + assert len(data.pensando_nic_cards) == 2 + assert data.pensando_nic_cards[0].id == "42424650-4c32-3533-3330-323934000000" + assert data.pensando_nic_cards[0].pcie_bdf == "0000:06:00.0" + assert data.pensando_nic_cards[0].asic == "salina" + assert data.pensando_nic_cards[1].serial_number == "FPL253710E5" + + +def test_collect_data_success(collector, conn_mock): + """Test successful collection of niccli/nicctl data.""" + collector.system_info.os_family = OSFamily.LINUX + + def run_sut_cmd_side_effect(cmd, **kwargs): + if "niccli" in cmd and ("--list" in cmd or "--list_devices" in cmd): + return MagicMock(exit_code=0, stdout=NICCLI_LISTDEV_OUTPUT, stderr="", command=cmd) + if cmd.strip() == "nicctl show card": + return MagicMock( + exit_code=0, + stdout="1111111-4c32-3533-3330-12345000000 0000:06:00.0\n", + stderr="", + command=cmd, + ) + if "nicctl" in cmd or "niccli" in cmd: + return MagicMock(exit_code=0, stdout="", stderr="", command=cmd) + return MagicMock(exit_code=1, stdout="", stderr="", command=cmd) + + collector._run_sut_cmd = MagicMock(side_effect=run_sut_cmd_side_effect) + + result, data = collector.collect_data() + + assert result.status == ExecutionStatus.OK + assert data is not None + assert isinstance(data, NicDataModel) + assert len(data.results) >= 1 diff --git a/test/unit/plugin/test_rdma_analyzer.py b/test/unit/plugin/test_rdma_analyzer.py new file mode 100644 index 00000000..2f477b11 --- /dev/null +++ b/test/unit/plugin/test_rdma_analyzer.py @@ -0,0 +1,279 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +import json +from pathlib import Path + +import pytest + +from nodescraper.enums import EventPriority, ExecutionStatus +from nodescraper.plugins.inband.rdma.rdma_analyzer import RdmaAnalyzer +from nodescraper.plugins.inband.rdma.rdmadata import ( + RdmaDataModel, + RdmaLink, + RdmaStatistics, +) + + +@pytest.fixture +def rdma_analyzer(system_info): + return RdmaAnalyzer(system_info) + + +@pytest.fixture +def plugin_fixtures_path(): + return Path(__file__).parent / "fixtures" + + +@pytest.fixture +def clean_rdma_model(plugin_fixtures_path): + """RDMA data with no errors (all counters zero).""" + path = plugin_fixtures_path / "rdma_statistic_example_data.json" + data = json.loads(path.read_text()) + stats = [RdmaStatistics(**s) for s in data] + return RdmaDataModel(statistic_list=stats) + + +@pytest.fixture +def clean_stats(plugin_fixtures_path): + """List of clean RdmaStatistics (no errors) for building models with links.""" + path = plugin_fixtures_path / "rdma_statistic_example_data.json" + data = json.loads(path.read_text()) + return [RdmaStatistics(**s) for s in data] + + +def test_no_errors_detected(rdma_analyzer, clean_rdma_model): + """Test with nominal data that has no errors.""" + result = rdma_analyzer.analyze_data(clean_rdma_model) + assert result.status == ExecutionStatus.OK + assert len(result.events) == 0 + + +def test_single_error_detected(rdma_analyzer, clean_rdma_model): + """Test with data containing a single error.""" + stats = list(clean_rdma_model.statistic_list) + stats[0].tx_roce_errors = 5 + model = RdmaDataModel(statistic_list=stats) + result = rdma_analyzer.analyze_data(model) + assert result.status == ExecutionStatus.ERROR + assert "RDMA errors detected in statistics" in result.message + assert len(result.events) == 1 + assert result.events[0].description == "RDMA error detected on bnxt_re0: [tx_roce_errors]" + assert result.events[0].priority == EventPriority.ERROR + assert result.events[0].data["errors"] == {"tx_roce_errors": 5} + assert result.events[0].data["interface"] == "bnxt_re0" + + +def test_multiple_errors_detected(rdma_analyzer, clean_rdma_model): + """Test with data containing multiple errors (grouped per interface).""" + stats = list(clean_rdma_model.statistic_list) + stats[0].tx_roce_errors = 10 + stats[0].rx_roce_errors = 3 + stats[1].packet_seq_err = 7 + model = RdmaDataModel(statistic_list=stats) + result = rdma_analyzer.analyze_data(model) + assert result.status == ExecutionStatus.ERROR + assert "RDMA errors detected in statistics" in result.message + assert len(result.events) == 2 # one per interface + for event in result.events: + assert event.priority == EventPriority.ERROR + # Total 3 errors across 2 interfaces + assert sum(len(e.data["errors"]) for e in result.events) == 3 + + +def test_critical_error_detected(rdma_analyzer, clean_rdma_model): + """Test with data containing a critical error (grouped per interface).""" + stats = list(clean_rdma_model.statistic_list) + stats[0].unrecoverable_err = 1 + stats[0].res_tx_pci_err = 2 + model = RdmaDataModel(statistic_list=stats) + result = rdma_analyzer.analyze_data(model) + assert result.status == ExecutionStatus.ERROR + assert "RDMA errors detected in statistics" in result.message + assert len(result.events) == 1 # one event per interface + assert result.events[0].priority == EventPriority.CRITICAL + assert "unrecoverable_err" in result.events[0].data["errors"] + assert "res_tx_pci_err" in result.events[0].data["errors"] + + +def test_empty_statistics(rdma_analyzer): + """Test with empty statistics list: WARNING and message logged.""" + model = RdmaDataModel(statistic_list=[], link_list=[]) + result = rdma_analyzer.analyze_data(model) + assert result.status == ExecutionStatus.WARNING + assert result.message == "No RDMA devices found" + + +def test_multiple_interfaces_with_errors(rdma_analyzer, clean_rdma_model): + """Test with errors across multiple interfaces.""" + stats = list(clean_rdma_model.statistic_list) + stats[0].max_retry_exceeded = 15 + stats[2].local_ack_timeout_err = 8 + stats[4].out_of_buffer = 100 + model = RdmaDataModel(statistic_list=stats) + result = rdma_analyzer.analyze_data(model) + assert result.status == ExecutionStatus.ERROR + assert len(result.events) == 3 + interfaces = {event.data["interface"] for event in result.events} + assert len(interfaces) == 3 + + +def test_all_error_types(rdma_analyzer): + """Test that all error fields are properly detected (grouped in one event).""" + stats = RdmaStatistics( + ifname="bnxt_re_test", + port=1, + recoverable_errors=1, + tx_roce_errors=1, + unrecoverable_err=1, + ) + model = RdmaDataModel(statistic_list=[stats]) + result = rdma_analyzer.analyze_data(model) + assert result.status == ExecutionStatus.ERROR + assert len(result.events) == 1 # one event per interface + assert "unrecoverable_err" in result.events[0].data["errors"] + assert result.events[0].priority == EventPriority.CRITICAL + assert set(result.events[0].data["errors"].keys()) == { + "recoverable_errors", + "tx_roce_errors", + "unrecoverable_err", + } + + +def test_zero_errors_are_ignored(rdma_analyzer): + """Test that zero-value errors are not reported.""" + stats = RdmaStatistics( + ifname="bnxt_re_test", + port=1, + tx_roce_errors=0, + rx_roce_errors=0, + unrecoverable_err=0, + ) + model = RdmaDataModel(statistic_list=[stats]) + result = rdma_analyzer.analyze_data(model) + assert result.status == ExecutionStatus.OK + assert len(result.events) == 0 + + +def test_rdma_link_all_active(rdma_analyzer, clean_stats): + """Test with RDMA links that are all active and up.""" + links = [ + RdmaLink( + ifindex=0, + ifname="ionic_0", + port=1, + state="ACTIVE", + physical_state="LINK_UP", + netdev="benic0p1", + netdev_index=3, + ), + RdmaLink( + ifindex=1, + ifname="ionic_1", + port=1, + state="ACTIVE", + physical_state="LINK_UP", + netdev="benic1p1", + netdev_index=4, + ), + ] + model = RdmaDataModel(statistic_list=clean_stats, link_list=links) + result = rdma_analyzer.analyze_data(model) + assert result.status == ExecutionStatus.OK + assert result.message == "No RDMA errors detected in statistics" + assert len(result.events) == 0 + + +def test_rdma_link_down_detected(rdma_analyzer, clean_stats): + """Test with RDMA links that are down""" + links = [ + RdmaLink( + ifindex=0, + ifname="ionic_0", + port=1, + state="ACTIVE", + physical_state="LINK_UP", + netdev="benic0p1", + netdev_index=3, + ), + RdmaLink( + ifindex=1, + ifname="ionic_1", + port=1, + state="DOWN", + physical_state="LINK_DOWN", + netdev="benic1p1", + netdev_index=4, + ), + ] + model = RdmaDataModel(statistic_list=clean_stats, link_list=links) + result = rdma_analyzer.analyze_data(model) + # Current implementation only checks statistics, not link state + assert result.status == ExecutionStatus.OK + + +def test_rdma_link_empty_list(rdma_analyzer, clean_stats): + """Test with empty RDMA link list.""" + model = RdmaDataModel(statistic_list=clean_stats, link_list=[]) + result = rdma_analyzer.analyze_data(model) + assert result.status == ExecutionStatus.OK + assert result.message == "No RDMA errors detected in statistics" + + +def test_rdma_link_multiple_interfaces(rdma_analyzer, clean_stats): + """Test with multiple RDMA interfaces with different link states.""" + links = [ + RdmaLink( + ifindex=0, + ifname="ionic_0", + port=1, + state="ACTIVE", + physical_state="LINK_UP", + netdev="benic0p1", + netdev_index=3, + ), + RdmaLink( + ifindex=1, + ifname="ionic_1", + port=1, + state="ACTIVE", + physical_state="LINK_UP", + netdev="benic1p1", + netdev_index=4, + ), + RdmaLink( + ifindex=2, + ifname="ionic_2", + port=1, + state="ACTIVE", + physical_state="LINK_UP", + netdev="benic2p1", + netdev_index=5, + ), + ] + model = RdmaDataModel(statistic_list=clean_stats, link_list=links) + result = rdma_analyzer.analyze_data(model) + assert result.status == ExecutionStatus.OK + assert len(result.events) == 0 diff --git a/test/unit/plugin/test_rdma_collector.py b/test/unit/plugin/test_rdma_collector.py new file mode 100644 index 00000000..595e7e33 --- /dev/null +++ b/test/unit/plugin/test_rdma_collector.py @@ -0,0 +1,156 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from pathlib import Path + +import pytest + +from nodescraper.connection.inband.inband import CommandArtifact +from nodescraper.enums import ExecutionStatus, OSFamily +from nodescraper.enums.systeminteraction import SystemInteractionLevel +from nodescraper.plugins.inband.rdma.rdma_collector import RdmaCollector +from nodescraper.plugins.inband.rdma.rdmadata import RdmaDataModel + + +@pytest.fixture +def collector(system_info, conn_mock): + return RdmaCollector( + system_info=system_info, + system_interaction_level=SystemInteractionLevel.PASSIVE, + connection=conn_mock, + ) + + +@pytest.fixture +def rdma_statistic_output(): + path = Path(__file__).parent / "fixtures" / "rdma_statistic_example_data.json" + return path.read_text() + + +@pytest.fixture +def rdma_link_output(): + path = Path(__file__).parent / "fixtures" / "rdma_link_example_data.json" + return path.read_text() + + +def test_collect_success(collector, conn_mock, rdma_link_output, rdma_statistic_output): + """Successful collection returns RdmaDataModel with statistics and links (full fixtures).""" + collector.system_info.os_family = OSFamily.LINUX + conn_mock.run_command.side_effect = [ + CommandArtifact(exit_code=0, stdout=rdma_link_output, stderr="", command="rdma link -j"), + CommandArtifact( + exit_code=0, stdout=rdma_statistic_output, stderr="", command="rdma statistic -j" + ), + CommandArtifact(exit_code=0, stdout="", stderr="", command="rdma dev"), + CommandArtifact(exit_code=0, stdout="", stderr="", command="rdma link"), + ] + res, data = collector.collect_data() + assert res.status == ExecutionStatus.OK + assert data is not None + assert isinstance(data, RdmaDataModel) + # Full statistic fixture has 8 devices (bnxt_re0..bnxt_re7) with full stats + assert len(data.statistic_list) == 8 + assert data.statistic_list[0].ifname == "bnxt_re0" + # Full link fixture has 4 ionic links + assert len(data.link_list) == 4 + assert data.link_list[0].ifname == "ionic_0" + + +def test_collect_both_commands_fail(collector, conn_mock): + """When all rdma commands fail, status is EXECUTION_FAILURE and data is None.""" + collector.system_info.os_family = OSFamily.LINUX + conn_mock.run_command.return_value = CommandArtifact( + exit_code=1, stdout="", stderr="rdma command failed", command="rdma link -j" + ) + res, data = collector.collect_data() + assert res.status == ExecutionStatus.EXECUTION_FAILURE + assert data is None + + +def test_collect_empty_output(collector, conn_mock): + """No RDMA devices: WARNING, message 'No RDMA devices found', no data so analyzer is skipped.""" + collector.system_info.os_family = OSFamily.LINUX + conn_mock.run_command.side_effect = [ + CommandArtifact(exit_code=0, stdout="[]", stderr="", command="rdma link -j"), + CommandArtifact(exit_code=0, stdout="[]", stderr="", command="rdma statistic -j"), + CommandArtifact(exit_code=0, stdout="", stderr="", command="rdma dev"), + CommandArtifact(exit_code=0, stdout="", stderr="", command="rdma link"), + ] + res, data = collector.collect_data() + assert res.status == ExecutionStatus.WARNING + assert res.message == "No RDMA devices found" + assert data is None + + +# Sample text output for rdma dev / rdma link (non-JSON) +RDMA_DEV_OUTPUT = """0: abcdef25s0: node_type ca fw 1.117.1-a-63 node_guid 1234:56ff:890f:1111 sys_image_guid 1234:56ff:890f:1111 +1: abcdef105s0: node_type ca fw 1.117.1-a-63 node_guid 2222:81ff:3333:b450 sys_image_guid 2222:81ff:3333:b450""" + +RDMA_LINK_OUTPUT = """link rocep9s0/1 state DOWN physical_state POLLING netdev benic8p1 +link abcdef25s0/1 state DOWN physical_state POLLING netdev mock7p1 +""" + + +def test_parse_rdma_dev_roce(collector): + """Test parsing rdma dev output with RoCE devices.""" + devices = collector._parse_rdma_dev(RDMA_DEV_OUTPUT) + assert len(devices) == 2 + device1 = devices[0] + assert device1.device == "abcdef25s0" + assert device1.node_type == "ca" + assert device1.attributes["fw_version"] == "1.117.1-a-63" + assert device1.node_guid == "1234:56ff:890f:1111" + assert device1.sys_image_guid == "1234:56ff:890f:1111" + device2 = devices[1] + assert device2.device == "abcdef105s0" + assert device2.node_type == "ca" + assert device2.node_guid == "2222:81ff:3333:b450" + + +def test_parse_rdma_dev_empty(collector): + """Test parsing empty rdma dev output.""" + devices = collector._parse_rdma_dev("") + assert len(devices) == 0 + + +def test_parse_rdma_link_text_roce(collector): + """Test parsing rdma link (text) output with RoCE devices.""" + links = collector._parse_rdma_link_text(RDMA_LINK_OUTPUT) + assert len(links) == 2 + link1 = next((link for link in links if link.device == "rocep9s0"), None) + assert link1 is not None + assert link1.port == 1 + assert link1.state == "DOWN" + assert link1.physical_state == "POLLING" + assert link1.netdev == "benic8p1" + link2 = next((link for link in links if link.device == "abcdef25s0"), None) + assert link2 is not None + assert link2.netdev == "mock7p1" + + +def test_parse_rdma_link_text_empty(collector): + """Test parsing empty rdma link (text) output.""" + links = collector._parse_rdma_link_text("") + assert len(links) == 0 diff --git a/test/unit/plugin/test_redfish_endpoint_analyzer.py b/test/unit/plugin/test_redfish_endpoint_analyzer.py new file mode 100644 index 00000000..0f014245 --- /dev/null +++ b/test/unit/plugin/test_redfish_endpoint_analyzer.py @@ -0,0 +1,209 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +import pytest + +from nodescraper.enums import ExecutionStatus +from nodescraper.plugins.ooband.redfish_endpoint import ( + RedfishEndpointAnalyzer, + RedfishEndpointAnalyzerArgs, + RedfishEndpointDataModel, +) +from nodescraper.plugins.ooband.redfish_endpoint.endpoint_analyzer import ( + _check_constraint, + _get_by_path, +) + + +@pytest.fixture +def redfish_endpoint_analyzer(system_info): + return RedfishEndpointAnalyzer(system_info=system_info) + + +def test_get_by_path_empty_returns_obj(): + obj = {"a": 1} + assert _get_by_path(obj, "") == obj + assert _get_by_path(obj, " ") == obj + + +def test_get_by_path_single_key(): + assert _get_by_path({"x": 42}, "x") == 42 + assert _get_by_path({"Status": {"Health": "OK"}}, "Status") == {"Health": "OK"} + + +def test_get_by_path_nested_slash(): + obj = {"Status": {"Health": "OK", "State": "Enabled"}} + assert _get_by_path(obj, "Status/Health") == "OK" + assert _get_by_path(obj, "Status/State") == "Enabled" + + +def test_get_by_path_list_index(): + obj = {"PowerControl": [{"PowerConsumedWatts": 100}, {"PowerConsumedWatts": 200}]} + assert _get_by_path(obj, "PowerControl/0/PowerConsumedWatts") == 100 + assert _get_by_path(obj, "PowerControl/1/PowerConsumedWatts") == 200 + + +def test_get_by_path_missing_returns_none(): + assert _get_by_path({"a": 1}, "b") is None + assert _get_by_path({"a": {"b": 2}}, "a/c") is None + assert _get_by_path(None, "a") is None + + +def test_get_by_path_invalid_list_index(): + obj = {"list": [1, 2, 3]} + assert _get_by_path(obj, "list/10") is None + assert _get_by_path(obj, "list/xyz") is None + + +def test_check_constraint_eq_pass(): + ok, msg = _check_constraint("On", {"eq": "On"}) + assert ok is True + + +def test_check_constraint_eq_fail(): + ok, msg = _check_constraint("Off", {"eq": "On"}) + assert ok is False + assert "On" in msg and "Off" in msg + + +def test_check_constraint_min_max_pass(): + ok, _ = _check_constraint(50, {"min": 0, "max": 100}) + assert ok is True + ok, _ = _check_constraint(0, {"min": 0}) + assert ok is True + ok, _ = _check_constraint(100, {"max": 100}) + assert ok is True + + +def test_check_constraint_min_fail(): + ok, msg = _check_constraint(10, {"min": 20}) + assert ok is False + assert "below min" in msg or "20" in msg + + +def test_check_constraint_max_fail(): + ok, msg = _check_constraint(150, {"max": 100}) + assert ok is False + assert "above max" in msg or "100" in msg + + +def test_check_constraint_any_of_pass(): + ok, _ = _check_constraint("OK", {"anyOf": ["OK", "Warning"]}) + assert ok is True + ok, _ = _check_constraint("Warning", {"anyOf": ["OK", "Warning"]}) + assert ok is True + + +def test_check_constraint_any_of_fail(): + ok, msg = _check_constraint("Critical", {"anyOf": ["OK", "Warning"]}) + assert ok is False + assert "any of" in msg or "OK" in msg + + +def test_check_constraint_literal_match(): + ok, _ = _check_constraint("On", "On") + assert ok is True + ok, msg = _check_constraint("Off", "On") + assert ok is False + + +def test_redfish_endpoint_analyzer_no_checks(redfish_endpoint_analyzer): + data = RedfishEndpointDataModel(responses={"/redfish/v1": {}}) + result = redfish_endpoint_analyzer.analyze_data(data, args=None) + assert result.status == ExecutionStatus.OK + assert result.message == "No checks configured" + + +def test_redfish_endpoint_analyzer_empty_checks(redfish_endpoint_analyzer): + data = RedfishEndpointDataModel(responses={"/redfish/v1": {"Status": {"Health": "OK"}}}) + result = redfish_endpoint_analyzer.analyze_data( + data, args=RedfishEndpointAnalyzerArgs(checks={}) + ) + assert result.status == ExecutionStatus.OK + assert result.message == "No checks configured" + + +def test_redfish_endpoint_analyzer_all_pass(redfish_endpoint_analyzer): + data = RedfishEndpointDataModel( + responses={ + "/redfish/v1/Systems/1": {"Status": {"Health": "OK"}, "PowerState": "On"}, + } + ) + args = RedfishEndpointAnalyzerArgs( + checks={ + "/redfish/v1/Systems/1": { + "Status/Health": {"anyOf": ["OK", "Warning"]}, + "PowerState": "On", + }, + } + ) + result = redfish_endpoint_analyzer.analyze_data(data, args=args) + assert result.status == ExecutionStatus.OK + assert result.message == "All Redfish endpoint checks passed" + + +def test_redfish_endpoint_analyzer_one_fail(redfish_endpoint_analyzer): + data = RedfishEndpointDataModel( + responses={ + "/redfish/v1/Systems/1": {"Status": {"Health": "Critical"}}, + } + ) + args = RedfishEndpointAnalyzerArgs( + checks={ + "/redfish/v1/Systems/1": {"Status/Health": {"anyOf": ["OK", "Warning"]}}, + } + ) + result = redfish_endpoint_analyzer.analyze_data(data, args=args) + assert result.status == ExecutionStatus.ERROR + assert "check(s) failed" in result.message + + +def test_redfish_endpoint_analyzer_uri_not_in_responses(redfish_endpoint_analyzer): + data = RedfishEndpointDataModel(responses={"/redfish/v1": {}}) + args = RedfishEndpointAnalyzerArgs( + checks={ + "/redfish/v1/Systems/1": {"Status/Health": "OK"}, + } + ) + result = redfish_endpoint_analyzer.analyze_data(data, args=args) + assert result.status == ExecutionStatus.ERROR + assert "check(s) failed" in result.message or "failed" in result.message + + +def test_redfish_endpoint_analyzer_wildcard_applies_to_all_bodies(redfish_endpoint_analyzer): + data = RedfishEndpointDataModel( + responses={ + "/redfish/v1/Chassis/1": {"Status": {"Health": "OK"}}, + "/redfish/v1/Chassis/2": {"Status": {"Health": "OK"}}, + } + ) + args = RedfishEndpointAnalyzerArgs( + checks={ + "*": {"Status/Health": {"anyOf": ["OK", "Warning"]}}, + } + ) + result = redfish_endpoint_analyzer.analyze_data(data, args=args) + assert result.status == ExecutionStatus.OK + assert result.message == "All Redfish endpoint checks passed" diff --git a/test/unit/plugin/test_redfish_endpoint_collector.py b/test/unit/plugin/test_redfish_endpoint_collector.py new file mode 100644 index 00000000..c2893ee1 --- /dev/null +++ b/test/unit/plugin/test_redfish_endpoint_collector.py @@ -0,0 +1,145 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +import pytest + +from nodescraper.connection.redfish import RedfishGetResult +from nodescraper.enums import EventCategory, ExecutionStatus +from nodescraper.plugins.ooband.redfish_endpoint import ( + RedfishEndpointCollector, + RedfishEndpointCollectorArgs, +) + + +@pytest.fixture +def redfish_endpoint_collector(system_info, redfish_conn_mock): + return RedfishEndpointCollector( + system_info=system_info, + connection=redfish_conn_mock, + ) + + +def test_redfish_endpoint_collector_no_uris(redfish_endpoint_collector): + result, data = redfish_endpoint_collector.collect_data() + assert result.status == ExecutionStatus.NOT_RAN + assert result.message == "No Redfish URIs configured" + assert data is None + + +def test_redfish_endpoint_collector_no_uris_with_args(redfish_endpoint_collector): + result, data = redfish_endpoint_collector.collect_data( + args=RedfishEndpointCollectorArgs(uris=[]) + ) + assert result.status == ExecutionStatus.NOT_RAN + assert data is None + + +def test_redfish_endpoint_collector_one_uri_success(redfish_endpoint_collector, redfish_conn_mock): + redfish_conn_mock.run_get.return_value = RedfishGetResult( + path="/redfish/v1", + success=True, + data={"Name": "Root"}, + status_code=200, + ) + result, data = redfish_endpoint_collector.collect_data( + args=RedfishEndpointCollectorArgs(uris=["/redfish/v1"]) + ) + assert result.status == ExecutionStatus.OK + assert result.message == "Collected 1 Redfish endpoint(s)" + assert data is not None + assert data.responses["/redfish/v1"]["Name"] == "Root" + redfish_conn_mock.run_get.assert_called_once() + call_path = redfish_conn_mock.run_get.call_args[0][0] + assert call_path == "/redfish/v1" or call_path.strip("/") == "redfish/v1" + + +def test_redfish_endpoint_collector_uri_normalized_with_leading_slash( + redfish_endpoint_collector, redfish_conn_mock +): + redfish_conn_mock.run_get.return_value = RedfishGetResult( + path="/redfish/v1/Systems", + success=True, + data={"Members": []}, + status_code=200, + ) + result, data = redfish_endpoint_collector.collect_data( + args=RedfishEndpointCollectorArgs(uris=["redfish/v1/Systems"]) + ) + assert result.status == ExecutionStatus.OK + assert data is not None + assert "/redfish/v1/Systems" in data.responses or "redfish/v1/Systems" in data.responses + + +def test_redfish_endpoint_collector_one_fail_no_success( + redfish_endpoint_collector, redfish_conn_mock +): + redfish_conn_mock.run_get.return_value = RedfishGetResult( + path="/redfish/v1", + success=False, + error="Connection refused", + status_code=None, + ) + result, data = redfish_endpoint_collector.collect_data( + args=RedfishEndpointCollectorArgs(uris=["/redfish/v1"]) + ) + assert result.status == ExecutionStatus.ERROR + assert result.message.startswith("No Redfish endpoints could be read") + assert data is None + assert len(result.events) >= 1 + assert any( + e.category == EventCategory.RUNTIME.value or "Redfish GET failed" in (e.description or "") + for e in result.events + ) + + +def test_redfish_endpoint_collector_mixed_success_fail( + redfish_endpoint_collector, redfish_conn_mock +): + def run_get_side_effect(path): + path_str = str(path) + if "Systems" in path_str: + return RedfishGetResult( + path=path_str if path_str.startswith("/") else "/" + path_str, + success=True, + data={"Id": "1"}, + status_code=200, + ) + return RedfishGetResult( + path=path_str if path_str.startswith("/") else "/" + path_str, + success=False, + error="Not Found", + status_code=404, + ) + + redfish_conn_mock.run_get.side_effect = run_get_side_effect + result, data = redfish_endpoint_collector.collect_data( + args=RedfishEndpointCollectorArgs(uris=["/redfish/v1/Systems", "/redfish/v1/Bad"]) + ) + assert result.status == ExecutionStatus.OK + assert data is not None + assert len(data.responses) == 1 + keys = list(data.responses.keys()) + assert any("Systems" in k for k in keys) + assert list(data.responses.values())[0].get("Id") == "1" diff --git a/test/unit/plugin/test_redfish_endpoint_plugin.py b/test/unit/plugin/test_redfish_endpoint_plugin.py new file mode 100644 index 00000000..6ca56f6b --- /dev/null +++ b/test/unit/plugin/test_redfish_endpoint_plugin.py @@ -0,0 +1,66 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from nodescraper.plugins.ooband.redfish_endpoint import ( + RedfishEndpointAnalyzer, + RedfishEndpointAnalyzerArgs, + RedfishEndpointCollector, + RedfishEndpointCollectorArgs, + RedfishEndpointDataModel, + RedfishEndpointPlugin, +) + + +def test_redfish_endpoint_collector_args_default(): + args = RedfishEndpointCollectorArgs() + assert args.uris == [] + + +def test_redfish_endpoint_collector_args_uris_stripped(): + args = RedfishEndpointCollectorArgs(uris=[" /redfish/v1 ", "/Systems/1 ", " Chassis "]) + assert args.uris == ["/redfish/v1", "/Systems/1", "Chassis"] + + +def test_redfish_endpoint_collector_args_uris_empty_list(): + args = RedfishEndpointCollectorArgs(uris=[]) + assert args.uris == [] + + +def test_redfish_endpoint_data_model_default(): + model = RedfishEndpointDataModel() + assert model.responses == {} + + +def test_redfish_endpoint_data_model_responses(): + model = RedfishEndpointDataModel(responses={"/redfish/v1": {"Name": "Root"}}) + assert model.responses["/redfish/v1"]["Name"] == "Root" + + +def test_redfish_endpoint_plugin_class_attributes(): + assert RedfishEndpointPlugin.DATA_MODEL is RedfishEndpointDataModel + assert RedfishEndpointPlugin.COLLECTOR is RedfishEndpointCollector + assert RedfishEndpointPlugin.ANALYZER is RedfishEndpointAnalyzer + assert RedfishEndpointPlugin.COLLECTOR_ARGS is RedfishEndpointCollectorArgs + assert RedfishEndpointPlugin.ANALYZER_ARGS is RedfishEndpointAnalyzerArgs diff --git a/test/unit/plugin/test_redfish_oem_diag_analyzer.py b/test/unit/plugin/test_redfish_oem_diag_analyzer.py new file mode 100644 index 00000000..a8d6f1f7 --- /dev/null +++ b/test/unit/plugin/test_redfish_oem_diag_analyzer.py @@ -0,0 +1,114 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +import pytest + +from nodescraper.enums import ExecutionStatus +from nodescraper.plugins.ooband.redfish_oem_diag import ( + RedfishOemDiagAnalyzer, + RedfishOemDiagAnalyzerArgs, + RedfishOemDiagDataModel, +) +from nodescraper.plugins.ooband.redfish_oem_diag.oem_diag_data import OemDiagTypeResult + + +@pytest.fixture +def redfish_oem_diag_analyzer(system_info): + return RedfishOemDiagAnalyzer(system_info=system_info) + + +def test_redfish_oem_diag_analyzer_no_results(redfish_oem_diag_analyzer): + data = RedfishOemDiagDataModel(results={}) + result = redfish_oem_diag_analyzer.analyze_data(data) + assert result.status == ExecutionStatus.NOT_RAN + assert result.message == "No OEM diagnostic results to analyze" + + +def test_redfish_oem_diag_analyzer_all_success(redfish_oem_diag_analyzer): + data = RedfishOemDiagDataModel( + results={ + "JournalControl": OemDiagTypeResult(success=True, error=None, metadata={}), + "AllLogs": OemDiagTypeResult(success=True, error=None, metadata={}), + } + ) + result = redfish_oem_diag_analyzer.analyze_data(data) + assert result.status == ExecutionStatus.OK + assert "2/2 types collected" in result.message + + +def test_redfish_oem_diag_analyzer_some_failed_without_require_all(redfish_oem_diag_analyzer): + data = RedfishOemDiagDataModel( + results={ + "JournalControl": OemDiagTypeResult(success=True, error=None, metadata={}), + "AllLogs": OemDiagTypeResult(success=False, error="Timeout", metadata=None), + } + ) + result = redfish_oem_diag_analyzer.analyze_data(data) + assert result.status == ExecutionStatus.OK + assert "1/2 types collected" in result.message + + +def test_redfish_oem_diag_analyzer_some_failed_with_require_all_success(redfish_oem_diag_analyzer): + data = RedfishOemDiagDataModel( + results={ + "JournalControl": OemDiagTypeResult(success=True, error=None, metadata={}), + "AllLogs": OemDiagTypeResult(success=False, error="Task timeout", metadata=None), + } + ) + args = RedfishOemDiagAnalyzerArgs(require_all_success=True) + result = redfish_oem_diag_analyzer.analyze_data(data, args=args) + assert result.status == ExecutionStatus.ERROR + assert "1 type(s) failed" in result.message + assert "AllLogs" in result.message + assert len(result.events) >= 1 + assert any( + "AllLogs" in (e.description or "") or "failed" in (e.description or "").lower() + for e in result.events + ) + + +def test_redfish_oem_diag_analyzer_all_failed_require_all_success(redfish_oem_diag_analyzer): + data = RedfishOemDiagDataModel( + results={ + "JournalControl": OemDiagTypeResult(success=False, error="Err1", metadata=None), + "AllLogs": OemDiagTypeResult(success=False, error="Err2", metadata=None), + } + ) + args = RedfishOemDiagAnalyzerArgs(require_all_success=True) + result = redfish_oem_diag_analyzer.analyze_data(data, args=args) + assert result.status == ExecutionStatus.ERROR + assert "2 type(s) failed" in result.message + + +def test_redfish_oem_diag_analyzer_require_all_success_all_ok(redfish_oem_diag_analyzer): + data = RedfishOemDiagDataModel( + results={ + "JournalControl": OemDiagTypeResult(success=True, error=None, metadata={}), + } + ) + args = RedfishOemDiagAnalyzerArgs(require_all_success=True) + result = redfish_oem_diag_analyzer.analyze_data(data, args=args) + assert result.status == ExecutionStatus.OK + assert "1/1 types collected" in result.message diff --git a/test/unit/plugin/test_redfish_oem_diag_collector.py b/test/unit/plugin/test_redfish_oem_diag_collector.py new file mode 100644 index 00000000..219e4d00 --- /dev/null +++ b/test/unit/plugin/test_redfish_oem_diag_collector.py @@ -0,0 +1,126 @@ +############################################################################### +# +# MIT License +# +# Copyright (c) 2026 Advanced Micro Devices, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +############################################################################### +from unittest.mock import patch + +import pytest + +from nodescraper.enums import ExecutionStatus +from nodescraper.plugins.ooband.redfish_oem_diag import ( + RedfishOemDiagCollector, + RedfishOemDiagCollectorArgs, + RedfishOemDiagDataModel, +) + + +@pytest.fixture +def redfish_oem_diag_collector(system_info, redfish_conn_mock): + return RedfishOemDiagCollector( + system_info=system_info, + connection=redfish_conn_mock, + ) + + +def test_redfish_oem_diag_collector_no_types_configured(redfish_oem_diag_collector): + result, data = redfish_oem_diag_collector.collect_data() + assert result.status == ExecutionStatus.NOT_RAN + assert result.message == "No OEM diagnostic types configured" + assert data is None + + +def test_redfish_oem_diag_collector_empty_types_with_args(redfish_oem_diag_collector): + result, data = redfish_oem_diag_collector.collect_data( + args=RedfishOemDiagCollectorArgs(oem_diagnostic_types=[]) + ) + assert result.status == ExecutionStatus.NOT_RAN + assert data is None + + +@patch("nodescraper.plugins.ooband.redfish_oem_diag.oem_diag_collector.collect_oem_diagnostic_data") +def test_redfish_oem_diag_collector_one_type_success(mock_collect, redfish_oem_diag_collector): + mock_collect.return_value = (b"log bytes", {"Size": 123}, None) + result, data = redfish_oem_diag_collector.collect_data( + args=RedfishOemDiagCollectorArgs(oem_diagnostic_types=["JournalControl"]) + ) + assert result.status == ExecutionStatus.OK + assert "1/1 types collected" in result.message + assert data is not None + assert isinstance(data, RedfishOemDiagDataModel) + assert "JournalControl" in data.results + assert data.results["JournalControl"].success is True + assert data.results["JournalControl"].error is None + assert data.results["JournalControl"].metadata == {"Size": 123} + mock_collect.assert_called_once() + + +@patch("nodescraper.plugins.ooband.redfish_oem_diag.oem_diag_collector.collect_oem_diagnostic_data") +def test_redfish_oem_diag_collector_one_type_failure(mock_collect, redfish_oem_diag_collector): + mock_collect.return_value = (None, None, "Task timeout") + result, data = redfish_oem_diag_collector.collect_data( + args=RedfishOemDiagCollectorArgs(oem_diagnostic_types=["AllLogs"]) + ) + assert result.status == ExecutionStatus.ERROR + assert "0/1 types collected" in result.message + assert data is not None + assert data.results["AllLogs"].success is False + assert data.results["AllLogs"].error == "Task timeout" + + +@patch("nodescraper.plugins.ooband.redfish_oem_diag.oem_diag_collector.collect_oem_diagnostic_data") +def test_redfish_oem_diag_collector_mixed_success_fail(mock_collect, redfish_oem_diag_collector): + def side_effect(conn, log_service_path, oem_diagnostic_type, **kwargs): + if oem_diagnostic_type == "JournalControl": + return (b"data", {}, None) + return (None, None, "Not supported") + + mock_collect.side_effect = side_effect + result, data = redfish_oem_diag_collector.collect_data( + args=RedfishOemDiagCollectorArgs(oem_diagnostic_types=["JournalControl", "AllLogs"]) + ) + assert result.status == ExecutionStatus.OK + assert "1/2 types collected" in result.message + assert data.results["JournalControl"].success is True + assert data.results["AllLogs"].success is False + assert data.results["AllLogs"].error == "Not supported" + assert mock_collect.call_count == 2 + + +@patch("nodescraper.plugins.ooband.redfish_oem_diag.oem_diag_collector.collect_oem_diagnostic_data") +def test_redfish_oem_diag_collector_passes_args_to_connection( + mock_collect, redfish_oem_diag_collector, redfish_conn_mock +): + mock_collect.return_value = (b"", {}, None) + args = RedfishOemDiagCollectorArgs( + log_service_path="redfish/v1/Systems/1/LogServices/DiagLogs", + oem_diagnostic_types=["JournalControl"], + task_timeout_s=300, + ) + redfish_oem_diag_collector.collect_data(args=args) + mock_collect.assert_called_once() + call_kw = mock_collect.call_args[1] + assert call_kw["log_service_path"] == "redfish/v1/Systems/1/LogServices/DiagLogs" + assert call_kw["oem_diagnostic_type"] == "JournalControl" + assert call_kw["task_timeout_s"] == 300 + assert call_kw["output_dir"] is None diff --git a/test/unit/plugin/test_rocm_collector.py b/test/unit/plugin/test_rocm_collector.py index ea7e4cdf..d1d1c09b 100644 --- a/test/unit/plugin/test_rocm_collector.py +++ b/test/unit/plugin/test_rocm_collector.py @@ -299,7 +299,7 @@ def test_invalid_rocm_version_format(collector): def test_collect_rocm_sub_versions(collector): - """Test collection of ROCm version and multiple sub-versions (mirrors error-scraper test_run_new_version).""" + """Test collection of ROCm version and multiple sub-versions.""" sub_versions_stdout = ( "/opt/rocm/.info/version:6.4.0-47\n" "/opt/rocm/.info/version-hip-libraries:6.4.0-47\n" diff --git a/test/unit/plugin/test_sys_settings_collector.py b/test/unit/plugin/test_sys_settings_collector.py index 88097aa8..b2da8b2e 100644 --- a/test/unit/plugin/test_sys_settings_collector.py +++ b/test/unit/plugin/test_sys_settings_collector.py @@ -163,3 +163,53 @@ def run_cmd(cmd, **kwargs): assert data.readings.get("/sys/kernel/mm/transparent_hugepage/enabled") == "safe" assert data.readings.get("/sys/kernel/mm/transparent_hugepage/defrag") == "safe" assert "/etc" not in str(seen_commands) + + +def test_collect_data_glob_path_uses_ls_long(linux_sys_settings_collector): + seen_commands = [] + + def run_cmd(cmd, **kwargs): + seen_commands.append(cmd) + if "ls -l" in cmd: + return make_artifact( + 0, "lrwxrwxrwx 1 root root 0 Jan 1 00:00 device -> ../../pci0000:00/0000:00:01.0" + ) + return make_artifact(0, "[always] madvise never") + + linux_sys_settings_collector._run_sut_cmd = run_cmd + args = {"paths": ["class/net/*/device"]} + result, data = linux_sys_settings_collector.collect_data(args) + + assert result.status == ExecutionStatus.OK + assert data is not None + assert len(seen_commands) == 1 + assert "ls -l /sys/class/net/*/device" in seen_commands[0] + assert "bash -c" in seen_commands[0] + assert data.readings.get("/sys/class/net/*/device") == ( + "lrwxrwxrwx 1 root root 0 Jan 1 00:00 device -> ../../pci0000:00/0000:00:01.0" + ) + + +def test_collect_data_mixed_paths_cat_and_glob(linux_sys_settings_collector): + + def run_cmd(cmd, **kwargs): + if "ls -l" in cmd: + return make_artifact(0, "lrwx 1 root root 0 device -> ../../device") + if "enabled" in cmd: + return make_artifact(0, "[always] madvise never") + return make_artifact(0, "[madvise] always never defer") + + linux_sys_settings_collector._run_sut_cmd = run_cmd + args = { + "paths": [PATH_ENABLED, "class/net/*/device", PATH_DEFRAG], + } + result, data = linux_sys_settings_collector.collect_data(args) + + assert result.status == ExecutionStatus.OK + assert data is not None + assert data.readings.get(PATH_ENABLED) == "always" + assert data.readings.get(PATH_DEFRAG) == "madvise" + assert ( + data.readings.get("/sys/class/net/*/device") == "lrwx 1 root root 0 device -> ../../device" + ) + assert "Sysfs collected 3 path(s)" in result.message