Skip to content

inab/github-importer

Repository files navigation

GitHub metadata importer

This tool imports GitHub metadata from repositories into the Software Observatory database. It identifies the GitHub repositories listed in the database entries, retrieves metadata for each repository using the GitHub metadata API, and stores the retrieved metadata back in the database.

If you are looking for a tool to import metadata from a single GitHub repository directly, you can use the GitHub metadata importer itself. In particular, this importer relies on this endpoint.

Features

The importer includes several safeguards to make long runs more robust:

  • --resume support to continue interrupted runs
  • local cache of successfully imported repositories
  • local cache of failed repositories
  • repository listing cache to avoid rebuilding the input list on every run
  • retry support for previously failed repositories
  • delays with jitter between requests
  • exponential backoff for rate limiting and transient server errors
  • JSONL run log for debugging and auditing

These mechanisms are especially useful to reduce the impact of 429 Too Many Requests errors and avoid repeating work after interruptions.

Installation

The tool is written in Python 3.12 and requires the packages listed in requirements.txt.

Install dependencies with:

pip install -r requirements.txt

Configuration

The tool requires the following environment variables to be set:

  • MONGO_HOST: the hostname of the MongoDB server.
  • MONGO_PORT: the port of the MongoDB server.
  • MONGO_USER: the username for the MongoDB server.
  • MONGO_PWD: the password for the MongoDB server.
  • MONGO_AUTH_SRC: the authentication source for the MongoDB server.
  • MONGO_DB: the name of the MongoDB database.
  • ALAMBIQUE: the name of the database where the gathered metadata will be stored.
  • PRETOOLS: the name of the Pretools database. The tool will read the list of repositories from this database.
  • GITHUB_TOKEN: the user GitHub token to use for the GitHub metadata API. The token must have read:packages enabled.

Put these environment variables in a .env file in the root directory of the project.

Usage

To run the tool, execute the following command:

python3 main.py

To resume an interrupted run:

python3 main.py --resume

This skips repositories already marked as completed in the local import cache.

To resume and retry previous failures:

python3 main.py --resume --retry-failed

This retries repositories that failed in earlier runs while still skipping successful ones.

To refresh the cached repository listing:

python3 main.py --refresh-listing-cache

This rebuilds the list of repositories from PRETOOLS instead of using the local listing cache.

To Limit the number of repositories processed:

python3 main.py --limit 20

For a slower, safer execution:

python3 main.py --resume --delay 3 --max-retries 8

Command-line options

The importer supports the following options:

  • --resume: skip repositories already completed in the import cache.
  • --retry-failed: when used with --resume, include repositories that failed in previous runs.
  • --cache-file: path to the import cache file. Default: github_import_cache.json.
  • --listing-cache-file: path to the repository listing cache file. Default: repos_to_import.json.
  • --refresh-listing-cache: rebuild the repository list from PRETOOLS
  • --run-log-file: path to the JSONL run log file. Default: import_run.jsonl.
  • --delay: base delay in seconds between requests. Default: 1.5.
  • --max-retries: maximum number of attempts per repository request. Default: 6.
  • --limit: maximum number of repositories to process in the current run.

Local cache files

During execution, the importer creates and updates a few local files:

* `repos_to_import.json`: cached list of repository URLs to process.
* `github_import_cache.json`: cache of completed and failed repositories.
* `import_run.jsonl`: append-only run log with one JSON record per processed repository.

These files allow the importer to resume work safely and avoid repeating already completed imports.

About

Imports GitHub metadata from repositories into the Software Observatory database.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors