ML License Compliance Agent

The compliance gap no tool was filling.

The first CI/CD agent that audits Python packages, Hugging Face models, and training datasets together - in a single automated pass. Powered by Anthropic Claude via GitLab Duo.

Try the demo โ†“ View on GitLab
65%
of Hugging Face models have no license metadata
3-in-1
Packages + models + datasets in one CI/CD job
โ‚ฌ15M
Max EU AI Act fine for non-compliant training data
How it works

Three scanners. One report.

The agent runs three scanners in sequence, applies your org's policy rules, and produces a CycloneDX ML Bill of Materials - natively parsed by GitLab's security dashboard.

01 - packages
Python dependency scan
Resolves the full transitive dependency tree. Catches GPL contamination hiding inside MIT-licensed packages.
02 - models
Hugging Face model scan
Queries the HF Hub API for every model referenced in your code. Flags Llama attribution, anti-distillation, and OpenRAIL restrictions.
03 - datasets
Training dataset scan
Detects load_dataset() calls and checks licenses. Flags CC-BY-NC datasets used in commercial training pipelines.
Agent in action

Triggered via GitLab issues. Simple as that.

Comment @ai-ml-license-compliance-flow on any GitLab issue, add a target directory, and the agent scans and reports back with findings and remediation steps.

AI Catalog Agent workflow in GitLab issue
๐Ÿ” Real scan of demo-repo/
5
Critical & high-severity findings
1
CRITICAL
3
HIGH
1
MEDIUM
๐Ÿšจ AGPL-3.0 SaaS Violation
ultralytics==8.0.200 deployed via FastAPI requires full source disclosure or enterprise license under AGPL Section 13.
โš ๏ธ Llama 3 Attribution Missing
Meta Llama 3 license ยง2 requires "Built with Meta Llama 3" attribution in all user-facing documentation.
โš ๏ธ Anti-Distillation Violation
distill.py uses Llama 3 outputs to train DistilGPT-2, violating Llama license ยง1.b.v.
โš ๏ธ CC-BY-NC Commercial Use
lmsys/chatbot_arena_conversations dataset (CC-BY-NC-4.0) cannot be used for commercial training.
โ„น๏ธ GPL-2.0+ Transitive Dependency
python-slugify โ†’ Unidecode (GPL-2.0+). Distributing this application requires GPL-2.0 source disclosure unless replaced with permissive alternative (text-unidecode).
Comparison

What the other tools miss.

FOSSA, Snyk, and Black Duck cover code dependencies well. None of them touch model licenses or training datasets.

Capability FOSSA Snyk Black Duck This agent
Python package licensesโœ“โœ“โœ“โœ“
Transitive dependency resolutionโœ“โœ“โœ“โœ“
AGPL + SaaS context detection---โœ“
HF model license lookup (API)--file scan onlyโœ“
Custom AI licenses (Llama, OpenRAIL)---โœ“
Llama attribution + anti-distillation---โœ“
Training dataset license scanning---โœ“
CC-BY-NC commercial use detection---โœ“
CycloneDX ML-BOM output-experimentalpartialโœ“
Plain-English summaries via Claude---โœ“
Interactive demo

See it catch real violations.

A real scan of demo-repo/ finding all 5 compliance violations (1 critical, 3 high, 1 medium) across packages, models, and datasets.

ml-license-compliance-agent - demo-repo/
Full compliance scan
Violations summary
# demo-repo/ - Complete real scan
 
# requirements.txt
python-slugify==8.0.1  # โ†’ Unidecode (GPL-2.0+)
ultralytics==8.0.200  # AGPL-3.0
 
# inference.py, distill.py
meta-llama/Meta-Llama-3-8B-Instruct  # Custom license
 
# train_data.py
load_dataset("lmsys/chatbot_arena_conversations")  # CC-BY-NC-4.0
CRITICAL ๐Ÿ“ฆ ultralytics==8.0.200 (AGPL-3.0) - SaaS deployment detected
AGPL-3.0 ยง13 requires full source disclosure for network services. SaaS context confirmed: FastAPI + uvicorn + Dockerfile. Entire application must be open-sourced or licensed commercially.
Fix: Purchase Ultralytics Enterprise License or replace with RT-DETR (Apache-2.0)
HIGH ๐Ÿค– meta-llama/Meta-Llama-3-8B-Instruct - missing attribution
Llama 3 license ยง2 requires "Built with Meta Llama 3" in all user-facing documentation. Not found in README or any public-facing files.
Fix: Add attribution statement to README.md
HIGH ๐Ÿค– meta-llama/Meta-Llama-3-8B-Instruct - anti-distillation violation
distill.py uses Llama 3 outputs to train DistilGPT-2. Llama license ยง1.b.v prohibits using model outputs to train competing models.
Fix: Use Llama 3.1+ (relaxed restrictions) or Mistral-7B (Apache-2.0) as teacher model
HIGH ๐Ÿ“Š lmsys/chatbot_arena_conversations - CC-BY-NC-4.0
This dataset prohibits commercial use. Cannot be used for training commercial models or in commercial SaaS products.
Fix: Replace with OpenAssistant/oasst2 (Apache-2.0) or Databricks Dolly (CC-BY-NC)
MEDIUM ๐Ÿ“ฆ python-slugify==8.0.1 โ†’ Unidecode (GPL-2.0+)
Transitive GPL dependency. Distribution (including Docker images) requires GPL-2.0 source disclosure.
Fix: pip install text-unidecode==1.3 (Artistic License - permissive)
1 critical   3 high   1 medium
SCAN FAILED - 5 violations found
# Agent reasoning & remediation priorities
 
Findings by severity:
1 CRITICAL (legal exposure / license infringement risk)
3 HIGH (clear license clause violations)
1 MEDIUM (supply chain risk / GPL contamination)
 
Immediate actions required:
1. Replace ultralytics (CRITICAL - licensing conflict)
2. Add Llama attribution (HIGH - 5-minute fix)
3. Remove distillation or swap teacher (HIGH - architecture change)
4. Replace dataset (HIGH - commercial use violation)
5. Fix GPL transitive dep (MEDIUM - distribution risk)
 
# CycloneDX ML-BOM generated + compliance report written
โœ“ Compliance Report Generated
All findings documented in compliance-report.md with actionable remediation steps, risk assessments, and recommended timelines.
โœ“ Report ready for review
Scan complete - 5 findings