# Simulacra Headless API
> Operating guide for the Simulacra Headless API.
## Canonical URLs
- [Interactive API docs](https://api.simulacra-data.com/__docs__/)
- [OpenAPI JSON, primary contract](https://api.simulacra-data.com/openapi.json)
- [Error catalog](https://api.simulacra-data.com/errors)
- [Full agent operating manual](https://api.simulacra-data.com/llms-full.txt)
- [Health check](https://api.simulacra-data.com/healthz)
Use HTTPS endpoints and `/openapi.json` as the machine-readable
contract for this API.
## Core Workflow
1. If credentials do not exist: `POST /v1/signup`, poll
`/v1/signup/{request_id}`, then claim the one-time secret at
`POST /v1/credential-claims`.
2. Mint an Auth0 machine-to-machine bearer token.
3. `POST /v1/datasets` with a seed file and an `Idempotency-Key`.
4. Poll `/v1/jobs/{job_id}` when the response is async.
5. GET `/v1/datasets/{dataset_id}/schema` before conditioning.
6. POST `/v1/datasets/{dataset_id}/generations` with an
`Idempotency-Key`.
7. Poll `/v1/jobs/{job_id}` or `/v1/generations/{generation_id}`.
8. Download `artifact_url` after status is `ready` or `partial`.
## Retry And Polling
- Treat `202 Accepted` as progress, not failure.
- Poll every 2 seconds in examples; production agents should use
jittered exponential backoff with a deadline.
- Retry transient `429`, `500`, `502`, `503`, and `504` only with the
same `Idempotency-Key` for POST requests.
- Do not retry `400`, `401`, `402`, `403`, or `404` without changing
the request.
- When an error body includes `code`, fetch `/errors/{code}` before
deciding whether the request is retry-safe.
## Coding Agent Rules
- Treat `/openapi.json` as canonical.
- Prefer stable `operationId` values over path-derived names.
- Follow OpenAPI Links when choosing the next operation.
- Generate clients from `/openapi.json` when useful.
- Reuse helper patterns for token minting, idempotent POST retries,
polling, schema-first conditions, downloads, and problem-code handling.
- Treat operations with `x-openai-isConsequential: true` as
customer-affecting actions that require explicit user intent.
- Use idempotency keys on POST retries.
- Do not guess categorical levels or numeric ranges; fetch schema.
- Categorical condition values are desired outcome percentages, not internal model parameters.
- Numeric conditions require both `min` and `max`.
- Tight scenarios can produce fewer rows than requested even when
quota remains; usage-billing is counted on generated rows rather
than requested rows.
- Preserve `X-Request-Id`, body `request_id`, `job_id`, `dataset_id`,
and `generation_id` for support.
- Preserve problem `code` and `type`; they are stable diagnostics.
- Never log bearer tokens, client secrets, customer data, claim tokens,
or artifact URLs.
## Security And Retention Defaults
- Access is company-approved and authenticated with Auth0 M2M tokens.
- SLA, privacy, data-processing, and security-documentation terms are
governed by each customer's enterprise agreement.
- Dataset and artifact retention are explicit API lifecycle concepts;
do not assume indefinite storage.
## Overview And Examples
## What Is Simulacra?
Simulacra is a research simulation and what-if scenario modeling
platform for consumer and market research teams. It augments existing
studies with high-fidelity synthetic data so teams can expand sample
sizes, rebalance cohorts, explore low-incidence audiences, and build
scenario models from the data they already trust.
The differentiator is conditioning: instead of only asking for more
rows, you can ask what the full dataset should look like under a
specific desired outcome mix, such as a premium-heavy segment, a
younger target audience, or a high-intent buying scenario. Simulacra
then generates a coherent synthetic dataset around that scenario,
subject to feasibility under the trained model.
The Headless API exposes that workflow programmatically: upload a seed
dataset, wait for training, generate scenario-conditioned synthetic
rows, and download the result as Parquet, CSV, or Arrow.
## Security, Compliance, And Retention
This API serves approved company tenants, not anonymous public
traffic. Resource routes require Auth0 machine-to-machine bearer
tokens, operator actions are audited, and production infrastructure
is monitored with security alerts and health probes.


Simulacra maintains SOC 2 audited and ISO/IEC 27001 certified
controls for this API and the broader platform.
- **Authentication:** OAuth2 client-credentials through Auth0. Treat
`client_secret` as a production secret and store it in your secret
manager.
- **Authorization:** credentials are tenant-scoped and tied to a
Simulacra-approved company. Sales and support users do not handle
customer `client_secret` values.
- **Transport and storage:** API traffic uses TLS. Retained customer
artifacts are encrypted at rest; standard managed mode uses
Simulacra-managed controls, while enterprise storage mode can
deliver artifacts through customer-controlled storage/key paths.
- **Default retention:** trained dataset artifacts and generated
outputs default to 24-hour retention. Explicit extension is bounded
by a 7-day maximum continuous dataset retention window. Managed
artifact download URLs are short-lived, with a 15-minute default.
- **Delete semantics:** `DELETE /v1/datasets/{dataset_id}` removes
active dataset access and associated retrievable dataset artifacts
from the API surface; generation artifacts expire on their own
retention windows.
- **Secrets and claims:** approved signup and rotation secrets are
delivered through encrypted one-time credential claims. Claim
tokens expire and cannot be reused after the secret is claimed.
- **Reliability:** long-running work is represented as async jobs.
Preserve `X-Request-Id`, body `request_id`, `job_id`, `dataset_id`,
and `generation_id` in your own support logs.
- **SLA and procurement:** these defaults are platform controls.
Uptime commitments, support response targets, data-processing
terms, and audit report access are governed by your enterprise
agreement with Simulacra.
## End-To-End Setup Flow
Use this as the happy path for a first integration. The endpoint
pages below are detailed references; this flow shows how their
values connect.
1. Request access with `POST /v1/signup`; save the returned
`request_id`. Re-submitting the same contact email returns
the existing pending or approved request instead of creating
a second queue entry.
2. Poll `GET /v1/signup/{request_id}` until the request is
approved.
3. Exchange the approved `credential_claim_token` at
`POST /v1/credential-claims`; store the returned
`client_secret` immediately.
4. Mint an Auth0 bearer token with the client credentials.
5. Upload a seed dataset with `POST /v1/datasets`; save
`dataset_id` and poll `job_id` when the response is async.
6. Inspect `GET /v1/datasets/{dataset_id}/schema`; use this
cleaned schema, not your original headers, to build
conditions.
7. Generate synthetic rows with
`POST /v1/datasets/{dataset_id}/generations`; save
`generation_id` and poll `job_id` when needed.
8. Fetch `GET /v1/generations/{generation_id}` until status is
`ready` or `partial` and `artifact_url` is present. `partial`
means the scenario was valid but fewer rows were feasible than
requested; usage-billing is counted on generated rows rather
than requested rows.
9. Download the artifact. Managed-mode URLs point back to this
API and still require the bearer token; enterprise URLs may be
absolute customer-storage URLs.
## What's New
- 2026-05-05: response identifier fields such as `dataset_id`,
`generation_id`, `job_id`, and `artifact_url` are JSON scalars
as documented. If you tested against an earlier preview and
added client code like `response['dataset_id'][0]`, remove any
`response['dataset_id'][0]` workaround before continuing.
## Versioning And Changelog Policy
The `/v1/*` contract is stable for production integrations.
Simulacra may add optional fields, new enum values, new endpoints,
or richer examples without a version bump. Breaking changes get
at least 30 days' notice or a future `/v2` surface.
Target notice for planned breaking changes is at least 30 days.
Security, legal, or emergency reliability fixes may move faster,
but should include direct customer communication and a clear
rollback or migration path when possible.
Customer-visible contract changes are listed in **What's New** above.
If your tooling consumes `/openapi.json`, diff the spec before
deploy and treat unknown new fields as forward-compatible.
## Client Integration
`/openapi.json` is the canonical machine-readable contract for
generated clients and API tooling. Direct HTTPS clients should
follow the examples below and the documented helper patterns for
Auth0 token minting, signup polling, one-time credential claims,
idempotent retries, async job polling, schema-first conditions,
artifact download, and problem-code classification.
Keep generated clients thin: preserve raw response fields, pass
through `X-Request-Id` and problem `code` values, and put polling,
retry, schema-resolution, and download behavior in a small helper
layer owned by your application.
## Common Mistakes To Avoid
- Do not build conditions from original column names. Training can
rename columns into identifier-safe form, for example
`purchase_intent` may become `purchase.intent` depending on the
cleaning path. Always GET the schema first.
- Cleaning can drop low-signal columns, near-zero-variance columns,
and rare categorical levels that are too sparse to model reliably. If a
column or level is not in the schema response, do not condition
on it.
- `credential_claim_token` is one-time-use. Do not close the
response before storing the returned `client_secret` in your
secret manager.
- Use an `Idempotency-Key` on every POST retry. Retrying without
one can create duplicate work and duplicate usage charges.
- Tight scenarios can produce fewer rows than requested even when
quota remains; Simulacra will never return more rows than
`row_count`. Usage-billing is counted on generated rows rather
than requested rows.
- Do not put bearer tokens, client secrets, artifact URLs, or
customer data in chat, browser consoles, notebooks shared with
third parties, or application logs.
## Retries, Quotas, And Billing
- `202 Accepted` is normal for dataset training and generation.
Poll `/v1/jobs/{job_id}` every 2 seconds for quickstarts; production
clients should use jittered exponential backoff with a deadline.
- Retry `POST /v1/datasets` and
`POST /v1/datasets/{dataset_id}/generations` only with an
`Idempotency-Key`. Reusing the same key makes the retry safe;
changing the key creates new work.
- Retry transient `429`, `500`, `502`, `503`, and `504` responses
with backoff. Do not retry `400`, `401`, `402`, `403`, or `404`
without changing the request.
- `402` means the request exceeds your company's active row
subscription or request-cap.
- Per-request row caps are safety limits. Tight scenarios can still
produce fewer rows than requested even when quota remains;
usage-billing is counted on generated rows rather than requested
rows.
## Error Catalog
Problem responses include `type`, `title`, `status`, and `detail`.
When the API can classify the failure, the body also includes a stable
`code` such as `simio_unknown_condition_column`, and `type` points to
`https://api.simulacra-data.com/errors/{code}`. Open that URL for
the cause, fix, retryability, and support guidance. Preserve
`X-Request-Id` and response-body `request_id` values when contacting
support.
Error codes are part of the v1 contract. New codes may be added;
renames or removals require a migration window.
- Catalog index: `/errors`
- Example: `/errors/simio_unknown_condition_column`
## Troubleshooting
- **401 Unauthorized:** mint a fresh Auth0 token and verify the
audience is `https://api.simulacra-data.com`. The Authorize
panel and the `Authorization: Bearer …` header expect the JWT
returned by Auth0 (begins with `eyJ…`), NOT your `client_secret`.
See *Authorize The Interactive Panel* above for the exchange.
If this happens during upload, reselect the seed file before
retrying.
- **400 request body is empty:** set `Content-Type: application/json`
for JSON endpoints and use multipart form-data only for dataset
uploads.
- **400 unknown condition column or level:** call
`/v1/datasets/{dataset_id}/schema` and rebuild conditions from the
cleaned schema. Original seed names may have been normalized.
- **202 keeps polling:** keep polling until `completed`, `failed`,
`expired`, or `cancelled`; use a deadline and preserve `job_id`.
- **partial generation:** the scenario was feasible only for a subset
of the requested rows. Inspect `rows_generated` before using the
artifact downstream.
- **404 on copied IDs:** identifier fields are JSON strings. If your
client still indexes `[0]`, it may be sending a one-character ID.
## Request Access
API access is request-and-approve. Before any of the credentials in
the Quickstart will work you need an approved tenant.
1. `POST /v1/signup` with your `company_name` and `contact_email`.
No login is required for this access request; leave the
Authorization field blank. The response includes a `request_id`.
2. `GET /v1/signup/{request_id}` returns `pending`, `approved`, or
`declined`. This check is also open because credentials do not
exist until the request is approved.
3. Once Simulacra approves, the status response includes a
`client_id` and a one-time `credential_claim_token`.
4. `POST /v1/credential-claims` with that token returns the
`client_secret` exactly once, plus the Auth0 token URL and
audience. Store it in your secret manager immediately.
```sh
RESP=$(curl -sS -X POST "https://api.simulacra-data.com/v1/signup" \
-H "content-type: application/json" \
-d '{
"company_name": "Acme Research",
"contact_email": "data-science@acme.example"
}')
# 202 Accepted for a new request, or 200 OK if this email already
# has a pending or approved request. Both shapes include request_id.
# Save request_id; the polling URL needs it verbatim. Valid format
# is ^req_[A-Za-z0-9]{1,64}$ — no dots, dashes, or whitespace.
REQUEST_ID=$(echo "${RESP}" | jq -r .request_id)
curl -sS "https://api.simulacra-data.com/v1/signup/${REQUEST_ID}"
# -> pending until an operator approves; then status flips to
# "approved" and includes `client_id` plus a one-time
# `credential_claim_token`.
APPROVAL=$(curl -sS "https://api.simulacra-data.com/v1/signup/${REQUEST_ID}")
CLAIM_TOKEN=$(echo "${APPROVAL}" | jq -r .credential_claim_token)
curl -sS -X POST "https://api.simulacra-data.com/v1/credential-claims" \
-H "content-type: application/json" \
-d "$(jq -nc --arg token "${CLAIM_TOKEN}" '{claim_token: $token}')"
# -> returns client_id, client_secret, token_url, audience, grant_type.
# The claim token is one-time-use; put client_secret in your
# secret manager, not in source code or chat.
```
If you are testing this from the interactive docs panel below, click
**TRY** on `POST /v1/signup`, fill the `company_name` and
`contact_email` fields, and **leave the Authorization field blank** —
this is the access-request step before credentials exist.
## Authorize The Interactive Panel
The interactive docs authenticate with the **HTTP Bearer** field
inside the **AUTHENTICATION** panel. That field expects an Auth0 JWT
access token, not your `client_secret`. JWTs always start with `eyJ`
and contain two dots; your `client_secret` has no fixed prefix and
is a single high-entropy ~64-character string with no dots. Paste
the wrong one and every protected call returns 401.
Click **AUTHENTICATION** in the left navigation. If you already
have a JWT, paste it into the HTTP Bearer field. If you only have
your `client_id` and `client_secret`, use the *Exchange credentials
and fill HTTP Bearer* form in that same panel. It exchanges the
credentials server-side and loads the JWT into HTTP Bearer for you.
Try-It on protected endpoints then succeeds. The token is valid for
~24 hours; rerun the form when it expires.
If you prefer the command line:
```sh
ACCESS_TOKEN=$(curl -sS -X POST "https://simulacra-data.us.auth0.com/oauth/token" \
-H "content-type: application/json" \
-d '{
"client_id": "YOUR_CLIENT_ID",
"client_secret": "YOUR_CLIENT_SECRET",
"audience": "https://api.simulacra-data.com",
"grant_type": "client_credentials"
}' | jq -r .access_token)
echo "$ACCESS_TOKEN"
```
### Rotating or replacing your client_secret
If you still have a working `client_secret` (or a valid bearer
token minted from it), use `POST /v1/credential-rotations` — it
rotates the secret at Auth0 and returns a one-time
`credential_claim_token` you redeem at `POST /v1/credential-claims`
for the new `client_secret`. The endpoint is authenticated; the
rate limit is three rotations per 24 hours per client.
In the interactive docs, `POST /v1/credential-rotations` gets a
per-tab `Idempotency-Key` automatically. If the browser reloads
or the response disappears before you redeem the claim token, retry
the same operation in that tab; the API returns the same
`credential_claim_token` without rotating Auth0 again.
If you have lost the `client_secret` entirely and cannot
authenticate, email
[support@simulacra-data.com](mailto:support@simulacra-data.com)
and reference your `client_id`. Simulacra will rotate operator-side
and deliver the new `credential_claim_token` over a secure channel.
Resubmitting the signup form does NOT re-issue a
`credential_claim_token` once your initial credential has been
claimed.
## Quickstart
Pick the language tab that matches your stack. Each script below is a
complete, runnable end-to-end flow: it mints a bearer token, uploads a
toy seed dataset, polls until training finishes, fetches the trained
schema, generates scenario-conditioned synthetic rows, downloads the
artifact, and reads it back. The flow is identical across languages —
only the syntax changes — and the conditioning request body is the
same JSON structure everywhere.
The scenario examples are intentionally built from normal client-side
objects — pandas data frames, R data frames, Julia DataFrames, or a
shell variable for SPSS automation. Your HTTP client serializes those
objects to JSON; you should not be hand-maintaining JSON files in a
production integration.
Always fetch `/v1/datasets/{dataset_id}/schema` before building
conditions. The trained schema is the customer-facing contract
after cleaning: names may be normalized, columns may be dropped,
and rare categorical levels may be removed.
The examples below resolve condition columns from the returned
schema before submitting the generation request.
All four scripts read these environment variables:
```sh
export SIMIO_CLIENT_ID="..."
export SIMIO_CLIENT_SECRET="..."
export SIMIO_AUTH0_DOMAIN="simulacra-data.us.auth0.com"
export SIMIO_AUTH0_AUDIENCE="https://api.simulacra-data.com"
export SIMIO_API_BASE="https://api.simulacra-data.com"
```
Jump to: [Python](#quickstart-python) · [R](#quickstart-r) ·
[Julia](#quickstart-julia) · [SPSS](#quickstart-spss)
### Python
Uses `requests` for HTTP, `pandas` + `numpy` for seed/scenario
construction, and `pyarrow` (a `pandas` extra) for Parquet I/O.
```python
import os, time, requests
import numpy as np
import pandas as pd
AUTH0_DOMAIN = os.environ["SIMIO_AUTH0_DOMAIN"]
AUTH0_AUDIENCE = os.environ["SIMIO_AUTH0_AUDIENCE"]
CLIENT_ID = os.environ["SIMIO_CLIENT_ID"]
CLIENT_SECRET = os.environ["SIMIO_CLIENT_SECRET"]
API_BASE = os.environ["SIMIO_API_BASE"]
# 1. Mint a bearer token.
token = requests.post(
f"https://{AUTH0_DOMAIN}/oauth/token",
json={
"client_id": CLIENT_ID,
"client_secret": CLIENT_SECRET,
"audience": AUTH0_AUDIENCE,
"grant_type": "client_credentials",
},
timeout=30,
).json()["access_token"]
auth = {"Authorization": f"Bearer {token}"}
# 2. Build a toy seed dataset and write it to seed.csv.
rng = np.random.default_rng(7)
n = 800
df = pd.DataFrame({
"age": rng.integers(18, 66, n),
"segment": rng.choice(["Value", "Mainstream", "Premium"],
n, p=[0.35, 0.45, 0.20]),
"channel": rng.choice(["Retail", "Online", "Club"],
n, p=[0.50, 0.35, 0.15]),
"region": rng.choice(["Northeast", "South", "Midwest", "West"], n),
})
intent = 42.0
intent += (df["segment"] == "Premium") * 18
intent += (df["channel"] == "Online") * 8
intent += ((df["age"] - 35) / 3).clip(-8, 8)
intent += rng.normal(0, 10, n)
df["purchase_intent"] = intent.round().clip(0, 100).astype(int)
df.to_csv("seed.csv", index=False)
# Helpers: 202 Accepted is normal for training and generation.
# This quickstart uses a simple 2s poll loop; production clients
# should add jittered exponential backoff and a deadline appropriate
# for their workload.
def poll_job(job_id, field, deadline_seconds=600):
deadline = time.time() + deadline_seconds
while time.time() < deadline:
body = requests.get(f"{API_BASE}/v1/jobs/{job_id}",
headers=auth, timeout=30).json()
if body["status"] in ("failed", "expired", "cancelled"):
raise RuntimeError(f"job {job_id} {body['status']}: {body}")
if body["status"] == "completed":
value = body.get(field)
if value:
return value
raise RuntimeError(f"job {job_id} completed without {field}: {body}")
time.sleep(2)
raise TimeoutError(f"job {job_id} did not complete before deadline")
def wait_for_dataset(upload):
if upload.get("status") == "ready" and upload.get("dataset_id"):
return upload["dataset_id"]
return poll_job(upload["job_id"], "dataset_id")
def wait_for_generation(initial):
generation_id = initial.get("generation_id")
if initial.get("status") == "processing" and initial.get("job_id"):
generation_id = poll_job(initial["job_id"], "generation_id")
if not generation_id:
raise RuntimeError(f"generation response lacked generation_id: {initial}")
deadline = time.time() + 600
while time.time() < deadline:
meta = requests.get(
f"{API_BASE}/v1/generations/{generation_id}",
headers=auth, timeout=30,
).json()
if meta.get("status") in ("ready", "partial") and meta.get("artifact_url"):
return meta
if meta.get("status") == "failed":
raise RuntimeError(f"generation failed: {meta}")
time.sleep(2)
raise TimeoutError(f"generation {generation_id} did not produce an artifact")
# 3. Upload seed + train. Returns either the dataset_id directly
# (synchronous) or a 202 with job_id (asynchronous).
with open("seed.csv", "rb") as fh:
upload = requests.post(
f"{API_BASE}/v1/datasets",
headers={**auth,
"Idempotency-Key": f"dataset-{int(time.time())}"},
files={"seed_file": ("seed.csv", fh, "text/csv")},
data={"display_name": "Quickstart seed", "wait_seconds": "10"},
timeout=60,
).json()
dataset_id = wait_for_dataset(upload)
# 4. Inspect the trained schema. Use the returned column names,
# levels, and ranges to build conditions; do not guess from the
# original seed headers.
schema = requests.get(
f"{API_BASE}/v1/datasets/{dataset_id}/schema",
headers=auth, timeout=30,
).json()
schema_columns = schema.get("schema", [])
schema_by_name = {col["name"]: col for col in schema_columns}
def resolve_column(*candidates):
for name in candidates:
if name in schema_by_name:
return name
normalized = {name.replace("_", ".").lower(): name for name in schema_by_name}
for name in candidates:
hit = normalized.get(name.replace("_", ".").lower())
if hit:
return hit
raise KeyError(
f"None of {candidates} is in the cleaned schema. Available: {list(schema_by_name)}"
)
def require_levels(column, levels):
available = set(schema_by_name[column].get("levels") or [])
missing = set(levels) - available
if missing:
raise KeyError(
f"Schema column {column!r} is missing levels {sorted(missing)}. "
f"Available levels: {sorted(available)}"
)
segment_col = resolve_column("segment")
channel_col = resolve_column("channel")
age_col = resolve_column("age")
intent_col = resolve_column("purchase_intent", "purchase.intent")
require_levels(segment_col, ["Premium", "Mainstream", "Value"])
require_levels(channel_col, ["Online", "Retail", "Club"])
# 5. Define a scenario as ordinary data frames, then convert to the
# API condition object. Categorical target_share values are desired
# outcome percentages, subject to feasibility jitter.
categorical_targets = pd.DataFrame([
{"column": segment_col, "level": "Premium", "target_share": 0.55},
{"column": segment_col, "level": "Mainstream", "target_share": 0.35},
{"column": segment_col, "level": "Value", "target_share": 0.10},
{"column": channel_col, "level": "Online", "target_share": 0.70},
{"column": channel_col, "level": "Retail", "target_share": 0.20},
{"column": channel_col, "level": "Club", "target_share": 0.10},
])
numeric_ranges = pd.DataFrame([
{"column": age_col, "min": 25, "max": 44},
{"column": intent_col, "min": 70, "max": 100},
])
def categorical_conditions(targets):
return {
column: dict(zip(group["level"], group["target_share"]))
for column, group in targets.groupby("column", sort=False)
}
def numeric_conditions(ranges):
return {
row.column: {"min": row.min, "max": row.max}
for row in ranges.itertuples(index=False)
}
scenario = {
"row_count": 5000,
"output_format": "parquet",
"seed": 20260430,
"wait_seconds": 20,
"conditions": {
"categorical": categorical_conditions(categorical_targets),
"numeric": numeric_conditions(numeric_ranges),
},
}
# 6. Generate the scenario-conditioned synthetic artifact.
gen = requests.post(
f"{API_BASE}/v1/datasets/{dataset_id}/generations",
headers={**auth,
"content-type": "application/json",
"Idempotency-Key": f"generation-{int(time.time())}"},
json=scenario, timeout=60,
).json()
meta = wait_for_generation(gen)
# 7. Download the artifact. artifact_url is absolute for enterprise
# tenants and relative (managed-mode relay) otherwise.
artifact_url = meta["artifact_url"]
if not artifact_url.startswith("http"):
artifact_url = f"{API_BASE}{artifact_url}"
art = requests.get(artifact_url, headers=auth, timeout=120)
art.raise_for_status()
with open("synthetic.parquet", "wb") as fh:
fh.write(art.content)
# 8. Read the result.
synth = pd.read_parquet("synthetic.parquet")
print(len(synth), "rows")
print(synth.head())
```
### R
Uses `httr2` for HTTP and `arrow` for Parquet I/O.
```r
library(httr2)
library(arrow)
AUTH0_DOMAIN <- Sys.getenv("SIMIO_AUTH0_DOMAIN")
AUTH0_AUDIENCE <- Sys.getenv("SIMIO_AUTH0_AUDIENCE")
CLIENT_ID <- Sys.getenv("SIMIO_CLIENT_ID")
CLIENT_SECRET <- Sys.getenv("SIMIO_CLIENT_SECRET")
API_BASE <- Sys.getenv("SIMIO_API_BASE")
# 1. Mint a bearer token.
token <- request(paste0("https://", AUTH0_DOMAIN, "/oauth/token")) |>
req_method("POST") |>
req_body_json(list(
client_id = CLIENT_ID,
client_secret = CLIENT_SECRET,
audience = AUTH0_AUDIENCE,
grant_type = "client_credentials"
)) |>
req_perform() |> resp_body_json()
TOKEN <- token$access_token
auth <- function(req) req_auth_bearer_token(req, TOKEN)
# 2. Build a toy seed dataset and write it to seed.csv.
set.seed(7); n <- 800
seed <- data.frame(
age = sample(18:65, n, replace = TRUE),
segment = sample(c("Value", "Mainstream", "Premium"), n,
replace = TRUE, prob = c(0.35, 0.45, 0.20)),
channel = sample(c("Retail", "Online", "Club"), n,
replace = TRUE, prob = c(0.50, 0.35, 0.15)),
region = sample(c("Northeast", "South", "Midwest", "West"), n,
replace = TRUE)
)
intent <- 42 +
(seed$segment == "Premium") * 18 +
(seed$channel == "Online") * 8 +
pmin(pmax((seed$age - 35) / 3, -8), 8) +
rnorm(n, 0, 10)
seed$purchase_intent <- pmax(pmin(round(intent), 100), 0)
write.csv(seed, "seed.csv", row.names = FALSE)
# Helper: a 202 response is normal for async work. Poll the job
# until it is completed. Production clients should add jittered
# exponential backoff and a deadline appropriate for their workload.
poll_job <- function(job_id, field) {
deadline <- Sys.time() + 600
repeat {
if (Sys.time() > deadline) stop("job ", job_id, " timed out")
body <- request(paste0(API_BASE, "/v1/jobs/", job_id)) |>
auth() |> req_perform() |> resp_body_json()
if (body$status %in% c("failed", "expired", "cancelled"))
stop("job ", job_id, " ", body$status)
if (identical(body$status, "completed")) {
value <- body[[field]]
if (!is.null(value) && nzchar(value)) return(value)
stop("job ", job_id, " completed without ", field)
}
Sys.sleep(2)
}
}
wait_for_generation <- function(initial) {
generation_id <- initial$generation_id
if (identical(initial$status, "processing") && !is.null(initial$job_id))
generation_id <- poll_job(initial$job_id, "generation_id")
if (is.null(generation_id) || !nzchar(generation_id))
stop("generation response lacked generation_id")
deadline <- Sys.time() + 600
repeat {
if (Sys.time() > deadline) stop("generation ", generation_id, " timed out")
meta <- request(paste0(API_BASE, "/v1/generations/", generation_id)) |>
auth() |> req_perform() |> resp_body_json()
if (meta$status %in% c("ready", "partial") &&
!is.null(meta$artifact_url) && nzchar(meta$artifact_url))
return(meta)
if (identical(meta$status, "failed")) stop("generation failed")
Sys.sleep(2)
}
}
# 3. Upload seed + train.
upload <- request(paste0(API_BASE, "/v1/datasets")) |>
req_method("POST") |> auth() |>
req_headers(`Idempotency-Key` =
paste0("dataset-", as.integer(Sys.time()))) |>
req_body_multipart(
seed_file = curl::form_file("seed.csv", type = "text/csv"),
display_name = "Quickstart seed",
wait_seconds = "10"
) |>
req_perform() |> resp_body_json()
dataset_id <- if (identical(upload$status, "ready") &&
!is.null(upload$dataset_id) && nzchar(upload$dataset_id))
upload$dataset_id else poll_job(upload$job_id, "dataset_id")
# 4. Inspect the trained schema.
schema <- request(paste0(API_BASE, "/v1/datasets/", dataset_id,
"/schema")) |>
auth() |> req_perform() |> resp_body_json()
schema_cols <- schema$schema
schema_names <- vapply(schema_cols, `[[`, character(1), "name")
resolve_column <- function(...) {
candidates <- c(...)
hit <- candidates[candidates %in% schema_names]
if (length(hit)) return(hit[[1]])
normalized <- setNames(schema_names, tolower(gsub("_", ".", schema_names)))
for (candidate in candidates) {
key <- tolower(gsub("_", ".", candidate))
hit <- normalized[[key]]
if (!is.null(hit) && !is.na(hit)) return(hit)
}
stop("None of ", paste(candidates, collapse = ", "),
" is in the cleaned schema. Available: ",
paste(schema_names, collapse = ", "))
}
require_levels <- function(column, levels) {
col <- schema_cols[[match(column, schema_names)]]
available <- if (is.null(col$levels)) character() else unlist(col$levels)
missing <- setdiff(levels, available)
if (length(missing)) stop("Schema column ", column,
" is missing levels: ",
paste(missing, collapse = ", "))
}
segment_col <- resolve_column("segment")
channel_col <- resolve_column("channel")
age_col <- resolve_column("age")
intent_col <- resolve_column("purchase_intent", "purchase.intent")
require_levels(segment_col, c("Premium", "Mainstream", "Value"))
require_levels(channel_col, c("Online", "Retail", "Club"))
# 5. Define a scenario as ordinary data frames, then convert to the
# API condition object. target_share values are desired outcome
# percentages, subject to feasibility jitter.
categorical_targets <- data.frame(
column = c(segment_col, segment_col, segment_col,
channel_col, channel_col, channel_col),
level = c("Premium", "Mainstream", "Value",
"Online", "Retail", "Club"),
target_share = c(0.55, 0.35, 0.10, 0.70, 0.20, 0.10)
)
numeric_ranges <- data.frame(
column = c(age_col, intent_col),
min = c(25, 70),
max = c(44, 100)
)
categorical_conditions <- lapply(
split(categorical_targets, categorical_targets$column),
function(x) as.list(setNames(x$target_share, x$level))
)
numeric_conditions <- setNames(
lapply(seq_len(nrow(numeric_ranges)), function(i) {
list(min = numeric_ranges$min[[i]], max = numeric_ranges$max[[i]])
}),
numeric_ranges$column
)
scenario <- list(
row_count = 5000L,
output_format = "parquet",
seed = 20260430L,
wait_seconds = 20L,
conditions = list(
categorical = categorical_conditions,
numeric = numeric_conditions
)
)
# 6. Generate the scenario-conditioned synthetic artifact.
gen <- request(paste0(API_BASE, "/v1/datasets/", dataset_id,
"/generations")) |>
req_method("POST") |> auth() |>
req_headers(`Idempotency-Key` =
paste0("generation-", as.integer(Sys.time()))) |>
req_body_json(scenario) |>
req_perform() |> resp_body_json()
meta <- wait_for_generation(gen)
# 7. Download the artifact.
artifact_url <- if (startsWith(meta$artifact_url, "http"))
meta$artifact_url else paste0(API_BASE, meta$artifact_url)
request(artifact_url) |> auth() |>
req_perform(path = "synthetic.parquet")
# 8. Read the result.
synth <- read_parquet("synthetic.parquet")
cat(nrow(synth), "rows\n")
head(synth)
```
### Julia
Uses `HTTP.jl` + `JSON3` for HTTP, `DataFrames` + `CSV` + `StatsBase`
for the seed, and `Parquet2` for the result.
```julia
using HTTP, JSON3, DataFrames, CSV, Random, StatsBase, Parquet2
AUTH0_DOMAIN = ENV["SIMIO_AUTH0_DOMAIN"]
AUTH0_AUDIENCE = ENV["SIMIO_AUTH0_AUDIENCE"]
CLIENT_ID = ENV["SIMIO_CLIENT_ID"]
CLIENT_SECRET = ENV["SIMIO_CLIENT_SECRET"]
API_BASE = ENV["SIMIO_API_BASE"]
# 1. Mint a bearer token.
token_resp = HTTP.post(
"https://$AUTH0_DOMAIN/oauth/token",
["content-type" => "application/json"],
JSON3.write(Dict(
"client_id" => CLIENT_ID,
"client_secret" => CLIENT_SECRET,
"audience" => AUTH0_AUDIENCE,
"grant_type" => "client_credentials",
)),
)
TOKEN = JSON3.read(token_resp.body).access_token
auth() = ["Authorization" => "Bearer $TOKEN"]
# 2. Build a toy seed dataset and write it to seed.csv.
Random.seed!(7)
n = 800
df = DataFrame(
age = rand(18:65, n),
segment = sample(["Value", "Mainstream", "Premium"],
StatsBase.Weights([0.35, 0.45, 0.20]), n),
channel = sample(["Retail", "Online", "Club"],
StatsBase.Weights([0.50, 0.35, 0.15]), n),
region = rand(["Northeast", "South", "Midwest", "West"], n),
)
intent = 42.0 .+
(df.segment .== "Premium") .* 18 .+
(df.channel .== "Online") .* 8 .+
clamp.((df.age .- 35) ./ 3, -8, 8) .+
randn(n) .* 10
df.purchase_intent = clamp.(round.(Int, intent), 0, 100)
CSV.write("seed.csv", df)
# Helper: a 202 response is normal for async work. Poll the job
# until it is completed. Production clients should add jittered
# exponential backoff and a deadline appropriate for their workload.
function poll_job(job_id, field)
deadline = time() + 600
while true
time() > deadline && error("job $job_id timed out")
r = HTTP.get("$API_BASE/v1/jobs/$job_id", auth())
body = JSON3.read(r.body)
body.status in ("failed", "expired", "cancelled") &&
error("job $job_id $(body.status)")
if body.status == "completed"
v = get(body, Symbol(field), nothing)
v === nothing && error("job $job_id completed without $field")
return string(v)
end
sleep(2)
end
end
function wait_for_generation(initial)
generation_id = get(initial, :generation_id, nothing)
if get(initial, :status, "") == "processing" &&
get(initial, :job_id, nothing) !== nothing
generation_id = poll_job(initial.job_id, "generation_id")
end
generation_id === nothing && error("generation response lacked generation_id")
deadline = time() + 600
while true
time() > deadline && error("generation $generation_id timed out")
meta = JSON3.read(HTTP.get(
"$API_BASE/v1/generations/$generation_id", auth()).body)
if meta.status in ("ready", "partial") &&
get(meta, :artifact_url, nothing) !== nothing
return meta
end
meta.status == "failed" && error("generation failed")
sleep(2)
end
end
# 3. Upload seed + train.
upload = HTTP.post(
"$API_BASE/v1/datasets",
vcat(auth(), ["Idempotency-Key" => "dataset-$(round(Int, time()))"]),
HTTP.Form(Dict(
"seed_file" => HTTP.Multipart("seed.csv", open("seed.csv"),
"text/csv"),
"display_name" => "Quickstart seed",
"wait_seconds" => "10",
)),
)
upload_body = JSON3.read(upload.body)
dataset_id = get(upload_body, :dataset_id, nothing)
dataset_id = get(upload_body, :status, "") == "ready" &&
!isnothing(dataset_id) ? string(dataset_id) :
poll_job(upload_body.job_id, "dataset_id")
# 4. Inspect the trained schema.
schema = JSON3.read(HTTP.get(
"$API_BASE/v1/datasets/$dataset_id/schema", auth()).body)
schema_cols = collect(schema.schema)
schema_names = [String(col.name) for col in schema_cols]
function resolve_column(candidates...)
for candidate in candidates
string(candidate) in schema_names && return string(candidate)
end
normalized = Dict(lowercase(replace(name, "_" => ".")) => name
for name in schema_names)
for candidate in candidates
key = lowercase(replace(string(candidate), "_" => "."))
haskey(normalized, key) && return normalized[key]
end
error("None of $(candidates) is in the cleaned schema. Available: $schema_names")
end
function require_levels(column, levels)
idx = findfirst(==(column), schema_names)
raw_levels = hasproperty(schema_cols[idx], :levels) ? schema_cols[idx].levels : String[]
available = Set(string.(raw_levels))
missing = setdiff(Set(levels), available)
!isempty(missing) && error("Schema column $column is missing levels: $missing")
end
segment_col = resolve_column("segment")
channel_col = resolve_column("channel")
age_col = resolve_column("age")
intent_col = resolve_column("purchase_intent", "purchase.intent")
require_levels(segment_col, ["Premium", "Mainstream", "Value"])
require_levels(channel_col, ["Online", "Retail", "Club"])
# 5. Define a scenario as ordinary DataFrames, then convert to the
# API condition object. target_share values are desired outcome
# percentages, subject to feasibility jitter.
categorical_targets = DataFrame(
column = [segment_col, segment_col, segment_col,
channel_col, channel_col, channel_col],
level = ["Premium", "Mainstream", "Value",
"Online", "Retail", "Club"],
target_share = [0.55, 0.35, 0.10, 0.70, 0.20, 0.10],
)
numeric_ranges = DataFrame(
column = [age_col, intent_col],
min = [25, 70],
max = [44, 100],
)
function categorical_conditions(targets)
Dict(
col => Dict(row.level => row.target_share
for row in eachrow(targets[targets.column .== col, :]))
for col in unique(targets.column)
)
end
numeric_conditions = Dict(
row.column => Dict("min" => row.min, "max" => row.max)
for row in eachrow(numeric_ranges)
)
scenario = Dict(
"row_count" => 5000,
"output_format" => "parquet",
"seed" => 20260430,
"wait_seconds" => 20,
"conditions" => Dict(
"categorical" => categorical_conditions(categorical_targets),
"numeric" => numeric_conditions,
),
)
# 6. Generate the scenario-conditioned synthetic artifact.
gen = HTTP.post(
"$API_BASE/v1/datasets/$dataset_id/generations",
vcat(auth(), ["content-type" => "application/json",
"Idempotency-Key" => "generation-$(round(Int, time()))"]),
JSON3.write(scenario),
)
gen_body = JSON3.read(gen.body)
meta = wait_for_generation(gen_body)
# 7. Download the artifact.
artifact_url = startswith(meta.artifact_url, "http") ?
String(meta.artifact_url) : "$API_BASE$(meta.artifact_url)"
open("synthetic.parquet", "w") do io
write(io, HTTP.get(artifact_url, auth()).body)
end
# 8. Read the result.
synth = DataFrame(Parquet2.Dataset("synthetic.parquet"))
println(nrow(synth), " rows")
first(synth, 5)
```
### SPSS
SPSS does not have a native HTTP client, so the realistic flow is to
drive the API from a small shell wrapper and then read the result with
SPSS syntax. SPSS users typically already have survey data they want to
use as a seed instead of synthesizing one — `seed_file` accepts CSV,
Parquet, Excel, and SAV directly.
Requires `curl` and `jq` (see [jq install docs](https://jqlang.org/download/))
on the box that runs the shell script.
The shell wrapper resolves cleaned column names client-side; the API
also validates submitted columns and levels against the cleaned schema
and returns a 400 with a schema hint if they do not match.
```sh
# 1. Mint a bearer token.
TOKEN=$(curl -sS -X POST "https://${SIMIO_AUTH0_DOMAIN}/oauth/token" \
-H "content-type: application/json" \
-d "{
\"client_id\":\"${SIMIO_CLIENT_ID}\",
\"client_secret\":\"${SIMIO_CLIENT_SECRET}\",
\"audience\":\"${SIMIO_AUTH0_AUDIENCE}\",
\"grant_type\":\"client_credentials\"
}" | jq -r ".access_token")
AUTH="Authorization: Bearer ${TOKEN}"
# Helper: poll a 202 job until it completes. Production clients
# should add jittered exponential backoff and a deadline matched
# to their workload.
poll_job_field() {
job_id="$1"; field="$2"; deadline=$((SECONDS + 600))
while [ "${SECONDS}" -lt "${deadline}" ]; do
sleep 2
job=$(curl -sS -H "${AUTH}" "${SIMIO_API_BASE}/v1/jobs/${job_id}")
status=$(echo "${job}" | jq -r ".status // empty")
if [ "${status}" = "failed" ] || [ "${status}" = "expired" ] || [ "${status}" = "cancelled" ]; then
echo "${job}" >&2; exit 1
fi
if [ "${status}" = "completed" ]; then
value=$(echo "${job}" | jq -r ".${field} // empty")
if [ -n "${value}" ]; then printf "%s" "${value}"; return 0; fi
echo "${job}" >&2; echo "job completed without ${field}" >&2; exit 1
fi
done
echo "job ${job_id} timed out" >&2; exit 1
}
wait_generation_artifact() {
generation_id="$1"; deadline=$((SECONDS + 600))
while [ "${SECONDS}" -lt "${deadline}" ]; do
meta=$(curl -sS -H "${AUTH}" "${SIMIO_API_BASE}/v1/generations/${generation_id}")
status=$(echo "${meta}" | jq -r ".status // empty")
artifact=$(echo "${meta}" | jq -r ".artifact_url // empty")
if { [ "${status}" = "ready" ] || [ "${status}" = "partial" ]; } && [ -n "${artifact}" ]; then
printf "%s" "${meta}"; return 0
fi
if [ "${status}" = "failed" ]; then echo "${meta}" >&2; exit 1; fi
sleep 2
done
echo "generation ${generation_id} timed out" >&2; exit 1
}
# 3. Upload seed + train. Use your existing SAV/CSV file here; we use
# survey.sav as an example.
UPLOAD=$(curl -sS -X POST "${SIMIO_API_BASE}/v1/datasets" \
-H "${AUTH}" -H "Idempotency-Key: dataset-$(date +%s)" \
-F "seed_file=@./survey.sav;type=application/x-spss-sav" \
-F "display_name=SPSS quickstart seed" \
-F "wait_seconds=10")
DATASET_ID=$(echo "${UPLOAD}" | jq -r ".dataset_id // empty")
JOB_ID=$(echo "${UPLOAD}" | jq -r ".job_id // empty")
if [ -z "${DATASET_ID}" ]; then
DATASET_ID=$(poll_job_field "${JOB_ID}" dataset_id);
fi
# 4. Inspect the trained schema. Use these cleaned names in
# conditions; do not assume original seed headers survived.
SCHEMA=$(curl -sS -H "${AUTH}" \
"${SIMIO_API_BASE}/v1/datasets/${DATASET_ID}/schema")
echo "${SCHEMA}" | jq .
resolve_col() {
a="$1"; b="${2:-$1}"
echo "${SCHEMA}" | jq -r --arg a "${a}" --arg b "${b}" \
'.schema[] | select(.name == $a or .name == $b) | .name' | head -n 1
}
SEGMENT_COL=$(resolve_col segment)
CHANNEL_COL=$(resolve_col channel)
AGE_COL=$(resolve_col age)
INTENT_COL=$(resolve_col purchase_intent purchase.intent)
if [ -z "${SEGMENT_COL}" ] || [ -z "${CHANNEL_COL}" ] || [ -z "${AGE_COL}" ] || [ -z "${INTENT_COL}" ]; then
echo "Expected quickstart columns were not all retained after cleaning" >&2; exit 1
fi
# 5. Build the scenario request in a shell variable. Categorical
# values are desired outcome percentages. Use output_format=csv
# so the artifact lands in a format SPSS reads directly.
SCENARIO=$(jq -cn --arg segment "${SEGMENT_COL}" --arg channel "${CHANNEL_COL}" \
--arg age "${AGE_COL}" --arg intent "${INTENT_COL}" '{
row_count: 5000,
output_format: "csv",
seed: 20260430,
wait_seconds: 20,
conditions: {
categorical: {
($segment): {Premium: 0.55, Mainstream: 0.35, Value: 0.10},
($channel): {Online: 0.70, Retail: 0.20, Club: 0.10}
},
numeric: {
($age): {min: 25, max: 44},
($intent): {min: 70, max: 100}
}
}
}')
GEN=$(curl -sS -X POST "${SIMIO_API_BASE}/v1/datasets/${DATASET_ID}/generations" \
-H "${AUTH}" -H "content-type: application/json" \
-H "Idempotency-Key: generation-$(date +%s)" \
--data "${SCENARIO}")
GENERATION_ID=$(echo "${GEN}" | jq -r ".generation_id // empty")
GEN_JOB_ID=$(echo "${GEN}" | jq -r ".job_id // empty")
GEN_STATUS=$(echo "${GEN}" | jq -r ".status // empty")
if [ "${GEN_STATUS}" = "processing" ] && [ -n "${GEN_JOB_ID}" ]; then
GENERATION_ID=$(poll_job_field "${GEN_JOB_ID}" generation_id)
fi
# 6. Download the artifact.
GEN_META=$(wait_generation_artifact "${GENERATION_ID}")
ARTIFACT_URL=$(echo "${GEN_META}" | jq -r ".artifact_url")
case "${ARTIFACT_URL}" in
http*) curl -sS -L "${ARTIFACT_URL}" -o synthetic.csv ;;
*) curl -sS -L -H "${AUTH}" \
"${SIMIO_API_BASE}${ARTIFACT_URL}" -o synthetic.csv ;;
esac
```
Then read the result in SPSS — adjust the variable list to match your
seed's column names and types:
```spss
GET DATA
/TYPE=TXT
/FILE='synthetic.csv'
/ENCODING='UTF8'
/DELCASE=LINE
/DELIMITERS=","
/QUALIFIER='"'
/ARRANGEMENT=DELIMITED
/FIRSTCASE=2
/IMPORTCASE=ALL
/VARIABLES=
age F8.0
segment A20
channel A20
region A20
purchase_intent F8.0.
CACHE.
EXECUTE.
```
## Conditioning Reference
- Categorical values under `conditions.categorical` are desired outcome
percentages: `0.70` means a 70% target share for that level, subject
to feasibility jitter.
- Numeric ranges under `conditions.numeric` are bilateral: include both
`min` and `max`.
- Always derive column names and levels from
`/v1/datasets/{dataset_id}/schema`. Do not guess from memory or
prompt context.
- Tight scenarios can produce fewer rows than requested even when
quota remains; usage-billing is counted on generated rows rather
than requested rows.
- Send an `Idempotency-Key` header on every `POST /v1/datasets` and
`POST /v1/datasets/{id}/generations` so retries are safe.
## Notes For Coding Agents
Agents should start at `/llms.txt`, use `/openapi.json` as the
canonical contract, and fetch `/v1/datasets/{dataset_id}/schema` before
constructing conditions. Do not infer private columns, dataset IDs,
or credential values from examples. Preserve `X-Request-Id` and any
response-body `request_id` in logs for support, but never log bearer
tokens or client secrets.
## Client Operating Rules
- Treat `202 Accepted` as normal; poll `/v1/jobs/{job_id}`.
- Store `X-Request-Id` and any response-body `request_id` for support.
- Use `/v1/datasets/{dataset_id}/schema` before conditional generation.
- Tight scenarios can produce fewer rows than requested even when
quota remains; usage-billing is counted on generated rows rather
than requested rows.
- Do not put client secrets in browsers, shared notebooks, or logs.