This tool imports GitHub metadata from repositories into the Software Observatory database. It identifies the GitHub repositories listed in the database entries, retrieves metadata for each repository using the GitHub metadata API, and stores the retrieved metadata back in the database.
If you are looking for a tool to import metadata from a single GitHub repository directly, you can use the GitHub metadata importer itself. In particular, this importer relies on this endpoint.
The importer includes several safeguards to make long runs more robust:
--resumesupport to continue interrupted runs- local cache of successfully imported repositories
- local cache of failed repositories
- repository listing cache to avoid rebuilding the input list on every run
- retry support for previously failed repositories
- delays with jitter between requests
- exponential backoff for rate limiting and transient server errors
- JSONL run log for debugging and auditing
These mechanisms are especially useful to reduce the impact of 429 Too Many Requests errors and avoid repeating work after interruptions.
The tool is written in Python 3.12 and requires the packages listed in requirements.txt.
Install dependencies with:
pip install -r requirements.txtThe tool requires the following environment variables to be set:
MONGO_HOST: the hostname of the MongoDB server.MONGO_PORT: the port of the MongoDB server.MONGO_USER: the username for the MongoDB server.MONGO_PWD: the password for the MongoDB server.MONGO_AUTH_SRC: the authentication source for the MongoDB server.MONGO_DB: the name of the MongoDB database.ALAMBIQUE: the name of the database where the gathered metadata will be stored.PRETOOLS: the name of the Pretools database. The tool will read the list of repositories from this database.GITHUB_TOKEN: the user GitHub token to use for the GitHub metadata API. The token must haveread:packagesenabled.
Put these environment variables in a .env file in the root directory of the project.
To run the tool, execute the following command:
python3 main.pyTo resume an interrupted run:
python3 main.py --resumeThis skips repositories already marked as completed in the local import cache.
To resume and retry previous failures:
python3 main.py --resume --retry-failedThis retries repositories that failed in earlier runs while still skipping successful ones.
To refresh the cached repository listing:
python3 main.py --refresh-listing-cacheThis rebuilds the list of repositories from PRETOOLS instead of using the local listing cache.
To Limit the number of repositories processed:
python3 main.py --limit 20For a slower, safer execution:
python3 main.py --resume --delay 3 --max-retries 8The importer supports the following options:
--resume: skip repositories already completed in the import cache.--retry-failed: when used with--resume, include repositories that failed in previous runs.--cache-file: path to the import cache file. Default:github_import_cache.json.--listing-cache-file: path to the repository listing cache file. Default:repos_to_import.json.--refresh-listing-cache: rebuild the repository list from PRETOOLS--run-log-file: path to the JSONL run log file. Default:import_run.jsonl.--delay: base delay in seconds between requests. Default: 1.5.--max-retries: maximum number of attempts per repository request. Default: 6.--limit: maximum number of repositories to process in the current run.
During execution, the importer creates and updates a few local files:
* `repos_to_import.json`: cached list of repository URLs to process.
* `github_import_cache.json`: cache of completed and failed repositories.
* `import_run.jsonl`: append-only run log with one JSON record per processed repository.
These files allow the importer to resume work safely and avoid repeating already completed imports.