skill-eval Security Report — Trusted | ClawSafe

5 /100

skill-eval

OpenClaw skill evaluation framework for measuring trigger rates and quality improvements of other skills via sessions_spawn + sessions_history.

Skill-eval is a legitimate OpenClaw evaluation framework that safely runs trigger rate and quality tests against other skills using declared agent tools, with all Python scripts serving as pure data processors with no exfiltration capabilities.

Skill Nameskill-eval

Duration47.2s

Enginepi

✓

Safe to install

This skill is safe to use. The subprocess/shell access is entirely managed by OpenClaw's agent tooling (sessions_spawn), which is declared and necessary for its evaluation function.

Findings 3 items

Severity	Finding	Location
Low	HTTP server bound to localhost viewer/generate_review.py starts an HTTPServer on 127.0.0.1:3117 for the review UI. The server regenerates HTML on each request and handles feedback POSTs. It is stdlib-only, localhost-only, and not exposed externally. `server = HTTPServer(('127.0.0.1', port), handler)` → No action needed. This is a standard development/review tool pattern. Consider documenting the port in SKILL.md if shell access would be needed.	`viewer/generate_review.py:197`
Low	lsof subprocess for port management The _kill_port function uses subprocess.run(['lsof', '-ti', f':{port}']) to find and kill processes on the target port before starting the HTTP server. This is a standard Unix pattern and the subprocess call is fully local. `subprocess.run(['lsof', '-ti', f':{port}'], capture_output=True, text=True, timeout=5)` → No action needed. This is standard system administration. Consider adding FileNotFoundError handling to gracefully skip if lsof is unavailable.	`viewer/generate_review.py:168`
Low	requests library declared without version pinning requirements.txt specifies requests>=2.28.0,<3.0.0 which is a reasonable version range. The analyze_*.py scripts explicitly note they have no external dependencies; requests is only needed for legacy oc_tools scripts. `requests>=2.28.0,<3.0.0` → No action needed. The version constraint is adequate. Pinning to a specific version would reduce supply chain risk further.	`requirements.txt:4`

Resource	Declared	Inferred	Status	Evidence
Filesystem	`READ`	`READ`	✓ Aligned	SKILL.md: Reads ~/.openclaw/openclaw.json for extraDirs discovery; scripts/* rea…
Filesystem	`WRITE`	`WRITE`	✓ Aligned	SKILL.md: 'Write to eval-workspace/' for evaluation results storage
Skill Invoke	`WRITE`	`WRITE`	✓ Aligned	SKILL.md: 'Call sessions_spawn' and 'Call sessions_history' for agent execution;…
Network	`NONE`	`NONE`	—	No external network requests. generate_review.py HTTP server is localhost-only (…
Shell	`WRITE`	`WRITE`	✓ Aligned	sessions_spawn runs subagents (which execute shell commands as part of evaluatio…
Environment	`NONE`	`NONE`	—	No iteration over os.environ for credential keys. resolve_paths.py only reads op…

3 findings

🔗

Medium External URL 外部 URL

https://img.shields.io/badge/License-MIT-yellow.svg

README.md:3

🔗

Medium External URL 外部 URL

https://api.zephyr.internal

test-skills/fake-tool/SKILL.md:14

🔗

Medium External URL 外部 URL

https://cdn.sheetjs.com/xlsx-0.20.3/package/dist/xlsx.full.min.js

viewer/viewer.html:10

File Tree

40 files · 288.2 KB · 8934 lines

Python 19f · 5572L Markdown 13f · 1518L HTML 1f · 1325L JSON 6f · 506L Text 1f · 13L

├─ ▾ 📁 agents

│ ├─ 📝 analyzer.md Markdown 60L · 2.1 KB

│ ├─ 📝 comparator.md Markdown 52L · 1.2 KB

│ └─ 📝 grader.md Markdown 148L · 4.8 KB

├─ ▾ 📁 docs

│ └─ 📝 ARCHITECTURE.md Markdown 250L · 9.6 KB

├─ ▾ 📁 evals

│ ├─ ▾ 📁 weather

│ │ ├─ 📋 quality.json JSON 43L · 1.3 KB

│ │ └─ 📋 triggers.json JSON 80L · 2.0 KB

│ ├─ 📋 example-quality.json JSON 110L · 3.4 KB

│ └─ 📋 example-triggers.json JSON 68L · 1.6 KB

├─ ▾ 📁 scripts

│ ├─ ▾ 📁 legacy

│ │ ├─ 📝 README.md Markdown 7L · 227 B

│ │ ├─ 🐍 run_compare.py Python 185L · 7.4 KB

│ │ ├─ 🐍 run_diagnostics.py Python 618L · 19.5 KB

│ │ ├─ 🐍 run_latency_profile.py Python 539L · 16.6 KB

│ │ ├─ 🐍 run_model_compare.py Python 629L · 19.0 KB

│ │ ├─ 🐍 run_orchestrator.py Python 298L · 9.9 KB

│ │ └─ 🐍 run_trigger.py Python 205L · 6.7 KB

│ ├─ 🐍 aggregate_benchmark.py Python 401L · 14.0 KB

│ ├─ 🐍 analyze_latency.py Python 232L · 7.2 KB

│ ├─ 🐍 analyze_model_compare.py Python 344L · 10.7 KB

│ ├─ 🐍 analyze_quality.py Python 223L · 7.2 KB

│ ├─ 🐍 analyze_triggers.py Python 257L · 8.6 KB

│ ├─ 🐍 build_evals_with_context.py Python 222L · 7.1 KB

│ ├─ 🐍 extract_session_history.py Python 172L · 4.8 KB

│ └─ 🐍 resolve_paths.py Python 200L · 6.7 KB

├─ ▾ 📁 templates

│ └─ ▾ 📁 cli-wrapper

│ ├─ 📋 quality.json JSON 127L · 4.0 KB

│ ├─ 📝 README.md Markdown 62L · 1.9 KB

│ └─ 📋 triggers.json JSON 78L · 2.1 KB

├─ ▾ 📁 test-skills

│ ├─ ▾ 📁 fake-tool

│ │ └─ 📝 SKILL.md Markdown 38L · 889 B

│ └─ 📝 README.md Markdown 48L · 1.2 KB

├─ ▾ 📁 tests

│ ├─ 🐍 __init__.py Python 1L · 24 B

│ ├─ 🐍 conftest.py Python 4L · 78 B

│ ├─ 🐍 test_analyze_quality.py Python 238L · 8.0 KB

│ └─ 🐍 test_analyze_triggers.py Python 333L · 11.7 KB

├─ ▾ 📁 viewer

│ ├─ 🐍 generate_review.py Python 471L · 16.0 KB

│ └─ 📄 viewer.html HTML 1325L · 43.9 KB

├─ 📝 CHANGELOG.md Markdown 37L · 1.3 KB

├─ 📝 CONTRIBUTING.md Markdown 47L · 1.6 KB

├─ 📝 README.md Markdown 241L · 7.8 KB

├─ 📄 requirements.txt Text 13L · 468 B

├─ 📝 SKILL.md Markdown 282L · 9.1 KB

└─ 📝 USAGE.md Markdown 246L · 6.6 KB

Dependencies 1 items

Package	Version	Source	Known Vulns	Notes
`requests`	`>=2.28.0,<3.0.0`	pip	No	Version range is adequate but not strictly pinned. Only used by legacy oc_tools integration, not by current analyze_*.py scripts.

Security Positives

✓ All runtime actions explicitly declared in SKILL.md Runtime Actions Disclosure table

✓ No credential harvesting — resolve_paths.py only reads openclaw.json for skill directory paths

✓ No data exfiltration — all Python scripts are pure data processors operating on local workspace files

✓ No suspicious patterns — no base64|bash, eval(atob()), curl|bash, or direct IP network requests

✓ No hidden functionality — every script's purpose is documented and traceable

✓ requests library correctly scoped — only for legacy oc_tools integration, not used by current analyze_*.py scripts

✓ viewer.html is a self-contained HTML report with no dynamic external fetches (SheetJS CDN is the only external resource, for spreadsheet rendering only)

✓ sessions_spawn/shell access is via OpenClaw's own tooling, not direct subprocess in skill code

Scan Report

Findings 3 items

File Tree

Dependencies 1 items

Security Positives