ML License Compliance Agent

How it works

Three scanners. One report.

The agent runs three scanners in sequence, applies your org's policy rules, and produces a CycloneDX ML Bill of Materials - natively parsed by GitLab's security dashboard.

01 - packages

Python dependency scan

Resolves the full transitive dependency tree. Catches GPL contamination hiding inside MIT-licensed packages.

02 - models

Hugging Face model scan

Queries the HF Hub API for every model referenced in your code. Flags Llama attribution, anti-distillation, and OpenRAIL restrictions.

03 - datasets

Training dataset scan

Detects load_dataset() calls and checks licenses. Flags CC-BY-NC datasets used in commercial training pipelines.

Agent in action

Triggered via GitLab issues. Simple as that.

Comment @ai-ml-license-compliance-flow on any GitLab issue, add a target directory, and the agent scans and reports back with findings and remediation steps.

AI Catalog Agent workflow in GitLab issue

🔍 Real scan of demo-repo/

5

Critical & high-severity findings

1

CRITICAL

3

HIGH

1

MEDIUM

🚨 AGPL-3.0 SaaS Violation

ultralytics==8.0.200 deployed via FastAPI requires full source disclosure or enterprise license under AGPL Section 13.

⚠️ Llama 3 Attribution Missing

Meta Llama 3 license §2 requires "Built with Meta Llama 3" attribution in all user-facing documentation.

⚠️ Anti-Distillation Violation

distill.py uses Llama 3 outputs to train DistilGPT-2, violating Llama license §1.b.v.

⚠️ CC-BY-NC Commercial Use

lmsys/chatbot_arena_conversations dataset (CC-BY-NC-4.0) cannot be used for commercial training.

ℹ️ GPL-2.0+ Transitive Dependency

python-slugify → Unidecode (GPL-2.0+). Distributing this application requires GPL-2.0 source disclosure unless replaced with permissive alternative (text-unidecode).

Comparison

What the other tools miss.

FOSSA, Snyk, and Black Duck cover code dependencies well. None of them touch model licenses or training datasets.

Capability	FOSSA	Snyk	Black Duck	This agent
Python package licenses	✓	✓	✓	✓
Transitive dependency resolution	✓	✓	✓	✓
AGPL + SaaS context detection	-	-	-	✓
HF model license lookup (API)	-	-	file scan only	✓
Custom AI licenses (Llama, OpenRAIL)	-	-	-	✓
Llama attribution + anti-distillation	-	-	-	✓
Training dataset license scanning	-	-	-	✓
CC-BY-NC commercial use detection	-	-	-	✓
CycloneDX ML-BOM output	-	experimental	partial	✓
Plain-English summaries via Claude	-	-	-	✓

Interactive demo

See it catch real violations.

A real scan of demo-repo/ finding all 5 compliance violations (1 critical, 3 high, 1 medium) across packages, models, and datasets.

Full compliance scan

Violations summary

# demo-repo/ - Complete real scan
 
# requirements.txt
python-slugify==8.0.1  # → Unidecode (GPL-2.0+)
ultralytics==8.0.200  # AGPL-3.0
 
# inference.py, distill.py
meta-llama/Meta-Llama-3-8B-Instruct  # Custom license
 
# train_data.py
load_dataset("lmsys/chatbot_arena_conversations")  # CC-BY-NC-4.0

CRITICAL 📦 ultralytics==8.0.200 (AGPL-3.0) - SaaS deployment detected

AGPL-3.0 §13 requires full source disclosure for network services. SaaS context confirmed: FastAPI + uvicorn + Dockerfile. Entire application must be open-sourced or licensed commercially.

Fix: Purchase Ultralytics Enterprise License or replace with RT-DETR (Apache-2.0)

HIGH 🤖 meta-llama/Meta-Llama-3-8B-Instruct - missing attribution

Llama 3 license §2 requires "Built with Meta Llama 3" in all user-facing documentation. Not found in README or any public-facing files.

Fix: Add attribution statement to README.md

HIGH 🤖 meta-llama/Meta-Llama-3-8B-Instruct - anti-distillation violation

distill.py uses Llama 3 outputs to train DistilGPT-2. Llama license §1.b.v prohibits using model outputs to train competing models.

Fix: Use Llama 3.1+ (relaxed restrictions) or Mistral-7B (Apache-2.0) as teacher model

HIGH 📊 lmsys/chatbot_arena_conversations - CC-BY-NC-4.0

This dataset prohibits commercial use. Cannot be used for training commercial models or in commercial SaaS products.

Fix: Replace with OpenAssistant/oasst2 (Apache-2.0) or Databricks Dolly (CC-BY-NC)

MEDIUM 📦 python-slugify==8.0.1 → Unidecode (GPL-2.0+)

Transitive GPL dependency. Distribution (including Docker images) requires GPL-2.0 source disclosure.

Fix: pip install text-unidecode==1.3 (Artistic License - permissive)

1 critical 3 high 1 medium

SCAN FAILED - 5 violations found

# Agent reasoning & remediation priorities
 
Findings by severity:
  1 CRITICAL (legal exposure / license infringement risk)
  3 HIGH (clear license clause violations)
  1 MEDIUM (supply chain risk / GPL contamination)
 
Immediate actions required:
1. Replace ultralytics (CRITICAL - licensing conflict)
2. Add Llama attribution (HIGH - 5-minute fix)
3. Remove distillation or swap teacher (HIGH - architecture change)
4. Replace dataset (HIGH - commercial use violation)
5. Fix GPL transitive dep (MEDIUM - distribution risk)
 
# CycloneDX ML-BOM generated + compliance report written

✓ Compliance Report Generated

All findings documented in compliance-report.md with actionable remediation steps, risk assessments, and recommended timelines.

✓ Report ready for review

Scan complete - 5 findings

The compliance gap no tool was filling.

Three scanners. One report.

Triggered via GitLab issues. Simple as that.

What the other tools miss.

See it catch real violations.