The upstream ontologist
The Debian Janitor is an automated system that commits fixes for (minor) issues in Debian packages that can be fixed by software. It gradually started proposing merges in early December. The first set of changes sent out ran lintian-brush on sid packages maintained in Git. This post is part of a series about the progress of the Janitor.
The upstream ontologist is a project that extracts metadata about upstream projects in a consistent format. It does this with a combination of heuristics and reading ecosystem-specific metadata files, such as Python’s setup.py, rust’s Cargo.toml as well as e.g. scanning README files.
Supported Data Sources
It will extract information from a wide variety of sources, including:
- Python package metadata (PKG-INFO, setup.py, setup.cfg, pyproject.toml)
- package.json
- composer.json
- package.xml
- Perl package metadata (dist.ini, META.json, META.yml, Makefile.PL)
- Perl POD files
- GNU configure files
- R DESCRIPTION files
- Rust Cargo.toml
- maven pom.xml
- metainfo.xml
- .git/config
- SECURITY.md
- DOAP
- Haskell cabal files
- Ruby gemspec files
- go.mod
- README{,.rst,.md} files
- Debian packaging metadata (debian/watch, debian/control, debian/rules, debian/get-orig-source.sh, debian/copyright, debian/patches)
Supported Fields
Fields that it currently provides include:
- Homepage: homepage URL
- Name: name of the upstream project
- Contact: contact address of some sort of the upstream (e-mail, mailing list URL)
- Repository: VCS URL
- Repository-Browse: Web URL for viewing the VCS
- Bug-Database: Bug database URL (for web viewing, generally)
- Bug-Submit: URL to use to submit new bugs (either on the web or an e-mail address)
- Screenshots: List of URLs with screenshots
- Archive: Archive used - e.g. SourceForge
- Security-Contact: e-mail or URL with instructions for reporting security issues
- Documentation: Link to documentation on the web:
- Wiki: Wiki URL
- Summary: one-line description of the project
- Description: longer description of the project
- License: Single line license description (e.g. “GPL 2.0”) as declared in the metadata[1]
- Copyright: List of copyright holders
- Version: Current upstream version
- Security-MD: URL to markdown file with security policy
All data fields have a “certainty” associated with them (“certain”, “confident”, “likely” or “possible”), which gets set depending on how the data was derived or where it was found. If multiple possible values were found for a specific field, then the value with the highest certainty is taken.
Interface
The ontologist provides a high-level Python API as well as two command-line tools that can write output in two different formats:
- guess-upstream-metadata writes DEP-12-like YAML output
- autodoap writes DOAP files
For example, running guess-upstream-metadata on dulwich:
% guess-upstream-metadata
<string>:2: (INFO/1) Duplicate implicit target name: "contributing".
Name: dulwich
Repository: https://www.dulwich.io/code/
X-Security-MD: https://github.com/dulwich/dulwich/tree/HEAD/SECURITY.md
X-Version: 0.20.21
Bug-Database: https://github.com/dulwich/dulwich/issues
X-Summary: Python Git Library
X-Description: |
This is the Dulwich project.
It aims to provide an interface to git repos (both local and remote) that
doesn't call out to git directly but instead uses pure Python.
X-License: Apache License, version 2 or GNU General Public License, version 2 or later.
Bug-Submit: https://github.com/dulwich/dulwich/issues/new
Lintian-Brush
lintian-brush can update DEP-12-style debian/upstream/metadata files that hold information about the upstream project that is packaged as well as the Homepage in the debian/control file based on information provided by the upstream ontologist. By default, it only imports data with the highest certainty - you can override this by specifying the —uncertain command-line flag.
[1] | Obviously this won’t be able to describe the full licensing situation for many projects. Projects like scancode-toolkit are more appropriate for that. |