Commands - Harmstack

harmstack

The root command. Without flags it prints help. Pass --haystack to run the Haystack benchmarking module directly without invoking a subcommand.

harmstack [flags]

Non-interactive example (Haystack module):

export HARMSTACK_API_KEY='<your_key>'
export TARGET_MODEL_API_KEY='<model_key>'

harmstack --haystack --consentandskip \
  --target-model-endpoint https://api.openai.com/v1/responses \
  --provider openai_responses \
  --benchmark-id 2 --benchmark-id 3 \
  --unit-count 1 --unit-count 1

Key flags: --haystack, --target-model-endpoint, --provider, --benchmark-id, --unit-count, --consentandskip

harmstack init

Launch the interactive wizard to create a new benchmarking job. The wizard prompts for your Harmstack API key (or reads HARMSTACK_API_KEY), verifies your account, and guides you through selecting a benchmark and configuring your model endpoint. Pass --consentandskip with the required flags to skip all prompts and run non-interactively.

harmstack init [flags]

Interactive:

export HARMSTACK_API_KEY='<your_key>'
harmstack init

Non-interactive:

export HARMSTACK_API_KEY='<your_key>'
export TARGET_MODEL_API_KEY='<model_key>'

harmstack init --consentandskip \
  --target-model-endpoint https://api.openai.com/v1/chat/completions \
  --provider openai \
  --model gpt-4o-mini \
  --benchmark-id 2 \
  --unit-count 5

Key flags: --target-model-endpoint, --target-model-api-key, --provider, --model, --benchmark-id, --unit-count, --consentandskip, --header

harmstack credits

Show the available credit balance on your account.

harmstack credits

Requires HARMSTACK_API_KEY to be set (or pass --harmstack-api-key).

harmstack list-jobs

List your most recent benchmarking jobs with scores. Output includes job ID, passed/failed counts, score percentage, and benchmark count.

harmstack list-jobs [flags]

Examples:

# List the 10 most recent completed jobs (default)
harmstack list-jobs

# List up to 25 jobs including failed ones, as CSV
harmstack list-jobs --limit 25 --status all --format csv

Key flags:

Flag	Default	Description
`--limit`	`10`	Maximum number of jobs to return
`--status`	`completed`	Filter by status: `completed`, `failed`, or `all`
`--format`	`table`	Output format: `table` or `csv`

harmstack show-job

Show metadata and scoring stats for a single job by its UUID.

harmstack show-job [job-id] [flags]

Examples:

# Pass the UUID as a positional argument
harmstack show-job 550e8400-e29b-41d4-a716-446655440000

# Pass it as a flag
harmstack show-job --job-id=550e8400-e29b-41d4-a716-446655440000

Output includes job ID, status, endpoint URL, benchmark counts (needle-annotated, non-annotated, total), and score. Key flags: --job-id

harmstack compare-jobs

Compare two jobs side by side. Displays passed, failed, score, and total benchmark counts for each job in a single table.

harmstack compare-jobs [job-a] [job-b] [flags]

Examples:

# Positional arguments
harmstack compare-jobs 550e8400-e29b-41d4-a716-446655440000 6ba7b810-9dad-11d1-80b4-00c04fd430c8

# Named flags
harmstack compare-jobs \
  --job-a=550e8400-e29b-41d4-a716-446655440000 \
  --job-b=6ba7b810-9dad-11d1-80b4-00c04fd430c8

Key flags: --job-a, --job-b

harmstack stats

Show aggregate statistics across your most recent completed jobs: total job count, average score, and the best and worst performing jobs.

harmstack stats [flags]

Examples:

# Stats over the last 30 jobs (default)
harmstack stats

# Stats over the last 90 jobs since a specific date
harmstack stats --limit 90 --since 2025-01-01

Key flags:

Flag	Default	Description
`--limit`	`30`	Number of recent completed jobs to include
`--since`	—	Restrict to jobs on or after this date (`YYYY-MM-DD`)

​harmstack

​harmstack init

​harmstack credits

​harmstack list-jobs

​harmstack show-job

​harmstack compare-jobs

​harmstack stats

harmstack

harmstack init

harmstack credits

harmstack list-jobs

harmstack show-job

harmstack compare-jobs

harmstack stats