Skip to main content

harmstack

The root command. Without flags it prints help. Pass --haystack to run the Haystack benchmarking module directly without invoking a subcommand.
harmstack [flags]
Non-interactive example (Haystack module):
export HARMSTACK_API_KEY='<your_key>'
export TARGET_MODEL_API_KEY='<model_key>'

harmstack --haystack --consentandskip \
  --target-model-endpoint https://api.openai.com/v1/responses \
  --provider openai_responses \
  --benchmark-id 2 --benchmark-id 3 \
  --unit-count 1 --unit-count 1
Key flags: --haystack, --target-model-endpoint, --provider, --benchmark-id, --unit-count, --consentandskip

harmstack init

Launch the interactive wizard to create a new benchmarking job. The wizard prompts for your Harmstack API key (or reads HARMSTACK_API_KEY), verifies your account, and guides you through selecting a benchmark and configuring your model endpoint. Pass --consentandskip with the required flags to skip all prompts and run non-interactively.
harmstack init [flags]
Interactive:
export HARMSTACK_API_KEY='<your_key>'
harmstack init
Non-interactive:
export HARMSTACK_API_KEY='<your_key>'
export TARGET_MODEL_API_KEY='<model_key>'

harmstack init --consentandskip \
  --target-model-endpoint https://api.openai.com/v1/chat/completions \
  --provider openai \
  --model gpt-4o-mini \
  --benchmark-id 2 \
  --unit-count 5
Key flags: --target-model-endpoint, --target-model-api-key, --provider, --model, --benchmark-id, --unit-count, --consentandskip, --header

harmstack credits

Show the available credit balance on your account.
harmstack credits
Requires HARMSTACK_API_KEY to be set (or pass --harmstack-api-key).

harmstack list-jobs

List your most recent benchmarking jobs with scores. Output includes job ID, passed/failed counts, score percentage, and benchmark count.
harmstack list-jobs [flags]
Examples:
# List the 10 most recent completed jobs (default)
harmstack list-jobs

# List up to 25 jobs including failed ones, as CSV
harmstack list-jobs --limit 25 --status all --format csv
Key flags:
FlagDefaultDescription
--limit10Maximum number of jobs to return
--statuscompletedFilter by status: completed, failed, or all
--formattableOutput format: table or csv

harmstack show-job

Show metadata and scoring stats for a single job by its UUID.
harmstack show-job [job-id] [flags]
Examples:
# Pass the UUID as a positional argument
harmstack show-job 550e8400-e29b-41d4-a716-446655440000

# Pass it as a flag
harmstack show-job --job-id=550e8400-e29b-41d4-a716-446655440000
Output includes job ID, status, endpoint URL, benchmark counts (needle-annotated, non-annotated, total), and score. Key flags: --job-id

harmstack compare-jobs

Compare two jobs side by side. Displays passed, failed, score, and total benchmark counts for each job in a single table.
harmstack compare-jobs [job-a] [job-b] [flags]
Examples:
# Positional arguments
harmstack compare-jobs 550e8400-e29b-41d4-a716-446655440000 6ba7b810-9dad-11d1-80b4-00c04fd430c8

# Named flags
harmstack compare-jobs \
  --job-a=550e8400-e29b-41d4-a716-446655440000 \
  --job-b=6ba7b810-9dad-11d1-80b4-00c04fd430c8
Key flags: --job-a, --job-b

harmstack stats

Show aggregate statistics across your most recent completed jobs: total job count, average score, and the best and worst performing jobs.
harmstack stats [flags]
Examples:
# Stats over the last 30 jobs (default)
harmstack stats

# Stats over the last 90 jobs since a specific date
harmstack stats --limit 90 --since 2025-01-01
Key flags:
FlagDefaultDescription
--limit30Number of recent completed jobs to include
--sinceRestrict to jobs on or after this date (YYYY-MM-DD)