GitHub metadata importer

This tool imports GitHub metadata from repositories into the Software Observatory database. It identifies the GitHub repositories listed in the database entries, retrieves metadata for each repository using the GitHub metadata API, and stores the retrieved metadata back in the database.

If you are looking for a tool to import metadata from a single GitHub repository directly, you can use the GitHub metadata importer itself. In particular, this importer relies on this endpoint.

Features

The importer includes several safeguards to make long runs more robust:

--resume support to continue interrupted runs
local cache of successfully imported repositories
local cache of failed repositories
repository listing cache to avoid rebuilding the input list on every run
retry support for previously failed repositories
delays with jitter between requests
exponential backoff for rate limiting and transient server errors
JSONL run log for debugging and auditing

These mechanisms are especially useful to reduce the impact of 429 Too Many Requests errors and avoid repeating work after interruptions.

Installation

The tool is written in Python 3.12 and requires the packages listed in requirements.txt.

Install dependencies with:

pip install -r requirements.txt

Configuration

The tool requires the following environment variables to be set:

MONGO_HOST: the hostname of the MongoDB server.
MONGO_PORT: the port of the MongoDB server.
MONGO_USER: the username for the MongoDB server.
MONGO_PWD: the password for the MongoDB server.
MONGO_AUTH_SRC: the authentication source for the MongoDB server.
MONGO_DB: the name of the MongoDB database.
ALAMBIQUE: the name of the database where the gathered metadata will be stored.
PRETOOLS: the name of the Pretools database. The tool will read the list of repositories from this database.
GITHUB_TOKEN: the user GitHub token to use for the GitHub metadata API. The token must have read:packages enabled.

Put these environment variables in a .env file in the root directory of the project.

Usage

To run the tool, execute the following command:

python3 main.py

To resume an interrupted run:

python3 main.py --resume

This skips repositories already marked as completed in the local import cache.

To resume and retry previous failures:

python3 main.py --resume --retry-failed

This retries repositories that failed in earlier runs while still skipping successful ones.

To refresh the cached repository listing:

python3 main.py --refresh-listing-cache

This rebuilds the list of repositories from PRETOOLS instead of using the local listing cache.

To Limit the number of repositories processed:

python3 main.py --limit 20

For a slower, safer execution:

python3 main.py --resume --delay 3 --max-retries 8

Command-line options

The importer supports the following options:

--resume: skip repositories already completed in the import cache.
--retry-failed: when used with --resume, include repositories that failed in previous runs.
--cache-file: path to the import cache file. Default: github_import_cache.json.
--listing-cache-file: path to the repository listing cache file. Default: repos_to_import.json.
--refresh-listing-cache: rebuild the repository list from PRETOOLS
--run-log-file: path to the JSONL run log file. Default: import_run.jsonl.
--delay: base delay in seconds between requests. Default: 1.5.
--max-retries: maximum number of attempts per repository request. Default: 6.
--limit: maximum number of repositories to process in the current run.

Local cache files

During execution, the importer creates and updates a few local files:

* `repos_to_import.json`: cached list of repository URLs to process.
* `github_import_cache.json`: cache of completed and failed repositories.
* `import_run.jsonl`: append-only run log with one JSON record per processed repository.

These files allow the importer to resume work safely and avoid repeating already completed imports.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
CITATION.cff		CITATION.cff
Dockerfile		Dockerfile
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
VERSION		VERSION
main.py		main.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GitHub metadata importer

Features

Installation

Configuration

Usage

Command-line options

Local cache files

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GitHub metadata importer

Features

Installation

Configuration

Usage

Command-line options

Local cache files

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages