# Simulacra Headless API > Operating guide for the Simulacra Headless API. ## Canonical URLs - [Interactive API docs](https://api.simulacra-data.com/__docs__/) - [OpenAPI JSON, primary contract](https://api.simulacra-data.com/openapi.json) - [Error catalog](https://api.simulacra-data.com/errors) - [Full agent operating manual](https://api.simulacra-data.com/llms-full.txt) - [Health check](https://api.simulacra-data.com/healthz) Use HTTPS endpoints and `/openapi.json` as the machine-readable contract for this API. ## Core Workflow 1. If credentials do not exist: `POST /v1/signup`, poll `/v1/signup/{request_id}`, then claim the one-time secret at `POST /v1/credential-claims`. 2. Mint an Auth0 machine-to-machine bearer token. 3. `POST /v1/datasets` with a seed file and an `Idempotency-Key`. 4. Poll `/v1/jobs/{job_id}` when the response is async. 5. GET `/v1/datasets/{dataset_id}/schema` before conditioning. 6. POST `/v1/datasets/{dataset_id}/generations` with an `Idempotency-Key`. 7. Poll `/v1/jobs/{job_id}` or `/v1/generations/{generation_id}`. 8. Download `artifact_url` after status is `ready` or `partial`. ## Retry And Polling - Treat `202 Accepted` as progress, not failure. - Poll every 2 seconds in examples; production agents should use jittered exponential backoff with a deadline. - Retry transient `429`, `500`, `502`, `503`, and `504` only with the same `Idempotency-Key` for POST requests. - Do not retry `400`, `401`, `402`, `403`, or `404` without changing the request. - When an error body includes `code`, fetch `/errors/{code}` before deciding whether the request is retry-safe. ## Coding Agent Rules - Treat `/openapi.json` as canonical. - Prefer stable `operationId` values over path-derived names. - Follow OpenAPI Links when choosing the next operation. - Generate clients from `/openapi.json` when useful. - Reuse helper patterns for token minting, idempotent POST retries, polling, schema-first conditions, downloads, and problem-code handling. - Treat operations with `x-openai-isConsequential: true` as customer-affecting actions that require explicit user intent. - Use idempotency keys on POST retries. - Do not guess categorical levels or numeric ranges; fetch schema. - Categorical condition values are desired outcome percentages, not internal model parameters. - Numeric conditions require both `min` and `max`. - Tight scenarios can produce fewer rows than requested even when quota remains; usage-billing is counted on generated rows rather than requested rows. - Preserve `X-Request-Id`, body `request_id`, `job_id`, `dataset_id`, and `generation_id` for support. - Preserve problem `code` and `type`; they are stable diagnostics. - Never log bearer tokens, client secrets, customer data, claim tokens, or artifact URLs. ## Security And Retention Defaults - Access is company-approved and authenticated with Auth0 M2M tokens. - SLA, privacy, data-processing, and security-documentation terms are governed by each customer's enterprise agreement. - Dataset and artifact retention are explicit API lifecycle concepts; do not assume indefinite storage. ## Overview And Examples ## What Is Simulacra? Simulacra is a research simulation and what-if scenario modeling platform for consumer and market research teams. It augments existing studies with high-fidelity synthetic data so teams can expand sample sizes, rebalance cohorts, explore low-incidence audiences, and build scenario models from the data they already trust. The differentiator is conditioning: instead of only asking for more rows, you can ask what the full dataset should look like under a specific desired outcome mix, such as a premium-heavy segment, a younger target audience, or a high-intent buying scenario. Simulacra then generates a coherent synthetic dataset around that scenario, subject to feasibility under the trained model. The Headless API exposes that workflow programmatically: upload a seed dataset, wait for training, generate scenario-conditioned synthetic rows, and download the result as Parquet, CSV, or Arrow. ## Security, Compliance, And Retention This API serves approved company tenants, not anonymous public traffic. Resource routes require Auth0 machine-to-machine bearer tokens, operator actions are audited, and production infrastructure is monitored with security alerts and health probes. ![SOC 2 audited controls](/assets/compliance/soc2.png) ![ISO/IEC 27001 certified controls](/assets/compliance/iso27001.png) Simulacra maintains SOC 2 audited and ISO/IEC 27001 certified controls for this API and the broader platform. - **Authentication:** OAuth2 client-credentials through Auth0. Treat `client_secret` as a production secret and store it in your secret manager. - **Authorization:** credentials are tenant-scoped and tied to a Simulacra-approved company. Sales and support users do not handle customer `client_secret` values. - **Transport and storage:** API traffic uses TLS. Retained customer artifacts are encrypted at rest; standard managed mode uses Simulacra-managed controls, while enterprise storage mode can deliver artifacts through customer-controlled storage/key paths. - **Default retention:** trained dataset artifacts and generated outputs default to 24-hour retention. Explicit extension is bounded by a 7-day maximum continuous dataset retention window. Managed artifact download URLs are short-lived, with a 15-minute default. - **Delete semantics:** `DELETE /v1/datasets/{dataset_id}` removes active dataset access and associated retrievable dataset artifacts from the API surface; generation artifacts expire on their own retention windows. - **Secrets and claims:** approved signup and rotation secrets are delivered through encrypted one-time credential claims. Claim tokens expire and cannot be reused after the secret is claimed. - **Reliability:** long-running work is represented as async jobs. Preserve `X-Request-Id`, body `request_id`, `job_id`, `dataset_id`, and `generation_id` in your own support logs. - **SLA and procurement:** these defaults are platform controls. Uptime commitments, support response targets, data-processing terms, and audit report access are governed by your enterprise agreement with Simulacra. ## End-To-End Setup Flow Use this as the happy path for a first integration. The endpoint pages below are detailed references; this flow shows how their values connect. 1. Request access with `POST /v1/signup`; save the returned `request_id`. Re-submitting the same contact email returns the existing pending or approved request instead of creating a second queue entry. 2. Poll `GET /v1/signup/{request_id}` until the request is approved. 3. Exchange the approved `credential_claim_token` at `POST /v1/credential-claims`; store the returned `client_secret` immediately. 4. Mint an Auth0 bearer token with the client credentials. 5. Upload a seed dataset with `POST /v1/datasets`; save `dataset_id` and poll `job_id` when the response is async. 6. Inspect `GET /v1/datasets/{dataset_id}/schema`; use this cleaned schema, not your original headers, to build conditions. 7. Generate synthetic rows with `POST /v1/datasets/{dataset_id}/generations`; save `generation_id` and poll `job_id` when needed. 8. Fetch `GET /v1/generations/{generation_id}` until status is `ready` or `partial` and `artifact_url` is present. `partial` means the scenario was valid but fewer rows were feasible than requested; usage-billing is counted on generated rows rather than requested rows. 9. Download the artifact. Managed-mode URLs point back to this API and still require the bearer token; enterprise URLs may be absolute customer-storage URLs. ## What's New - 2026-05-05: response identifier fields such as `dataset_id`, `generation_id`, `job_id`, and `artifact_url` are JSON scalars as documented. If you tested against an earlier preview and added client code like `response['dataset_id'][0]`, remove any `response['dataset_id'][0]` workaround before continuing. ## Versioning And Changelog Policy The `/v1/*` contract is stable for production integrations. Simulacra may add optional fields, new enum values, new endpoints, or richer examples without a version bump. Breaking changes get at least 30 days' notice or a future `/v2` surface. Target notice for planned breaking changes is at least 30 days. Security, legal, or emergency reliability fixes may move faster, but should include direct customer communication and a clear rollback or migration path when possible. Customer-visible contract changes are listed in **What's New** above. If your tooling consumes `/openapi.json`, diff the spec before deploy and treat unknown new fields as forward-compatible. ## Client Integration `/openapi.json` is the canonical machine-readable contract for generated clients and API tooling. Direct HTTPS clients should follow the examples below and the documented helper patterns for Auth0 token minting, signup polling, one-time credential claims, idempotent retries, async job polling, schema-first conditions, artifact download, and problem-code classification. Keep generated clients thin: preserve raw response fields, pass through `X-Request-Id` and problem `code` values, and put polling, retry, schema-resolution, and download behavior in a small helper layer owned by your application. ## Common Mistakes To Avoid - Do not build conditions from original column names. Training can rename columns into identifier-safe form, for example `purchase_intent` may become `purchase.intent` depending on the cleaning path. Always GET the schema first. - Cleaning can drop low-signal columns, near-zero-variance columns, and rare categorical levels that are too sparse to model reliably. If a column or level is not in the schema response, do not condition on it. - `credential_claim_token` is one-time-use. Do not close the response before storing the returned `client_secret` in your secret manager. - Use an `Idempotency-Key` on every POST retry. Retrying without one can create duplicate work and duplicate usage charges. - Tight scenarios can produce fewer rows than requested even when quota remains; Simulacra will never return more rows than `row_count`. Usage-billing is counted on generated rows rather than requested rows. - Do not put bearer tokens, client secrets, artifact URLs, or customer data in chat, browser consoles, notebooks shared with third parties, or application logs. ## Retries, Quotas, And Billing - `202 Accepted` is normal for dataset training and generation. Poll `/v1/jobs/{job_id}` every 2 seconds for quickstarts; production clients should use jittered exponential backoff with a deadline. - Retry `POST /v1/datasets` and `POST /v1/datasets/{dataset_id}/generations` only with an `Idempotency-Key`. Reusing the same key makes the retry safe; changing the key creates new work. - Retry transient `429`, `500`, `502`, `503`, and `504` responses with backoff. Do not retry `400`, `401`, `402`, `403`, or `404` without changing the request. - `402` means the request exceeds your company's active row subscription or request-cap. - Per-request row caps are safety limits. Tight scenarios can still produce fewer rows than requested even when quota remains; usage-billing is counted on generated rows rather than requested rows. ## Error Catalog Problem responses include `type`, `title`, `status`, and `detail`. When the API can classify the failure, the body also includes a stable `code` such as `simio_unknown_condition_column`, and `type` points to `https://api.simulacra-data.com/errors/{code}`. Open that URL for the cause, fix, retryability, and support guidance. Preserve `X-Request-Id` and response-body `request_id` values when contacting support. Error codes are part of the v1 contract. New codes may be added; renames or removals require a migration window. - Catalog index: `/errors` - Example: `/errors/simio_unknown_condition_column` ## Troubleshooting - **401 Unauthorized:** mint a fresh Auth0 token and verify the audience is `https://api.simulacra-data.com`. The Authorize panel and the `Authorization: Bearer …` header expect the JWT returned by Auth0 (begins with `eyJ…`), NOT your `client_secret`. See *Authorize The Interactive Panel* above for the exchange. If this happens during upload, reselect the seed file before retrying. - **400 request body is empty:** set `Content-Type: application/json` for JSON endpoints and use multipart form-data only for dataset uploads. - **400 unknown condition column or level:** call `/v1/datasets/{dataset_id}/schema` and rebuild conditions from the cleaned schema. Original seed names may have been normalized. - **202 keeps polling:** keep polling until `completed`, `failed`, `expired`, or `cancelled`; use a deadline and preserve `job_id`. - **partial generation:** the scenario was feasible only for a subset of the requested rows. Inspect `rows_generated` before using the artifact downstream. - **404 on copied IDs:** identifier fields are JSON strings. If your client still indexes `[0]`, it may be sending a one-character ID. ## Request Access API access is request-and-approve. Before any of the credentials in the Quickstart will work you need an approved tenant. 1. `POST /v1/signup` with your `company_name` and `contact_email`. No login is required for this access request; leave the Authorization field blank. The response includes a `request_id`. 2. `GET /v1/signup/{request_id}` returns `pending`, `approved`, or `declined`. This check is also open because credentials do not exist until the request is approved. 3. Once Simulacra approves, the status response includes a `client_id` and a one-time `credential_claim_token`. 4. `POST /v1/credential-claims` with that token returns the `client_secret` exactly once, plus the Auth0 token URL and audience. Store it in your secret manager immediately. ```sh RESP=$(curl -sS -X POST "https://api.simulacra-data.com/v1/signup" \ -H "content-type: application/json" \ -d '{ "company_name": "Acme Research", "contact_email": "data-science@acme.example" }') # 202 Accepted for a new request, or 200 OK if this email already # has a pending or approved request. Both shapes include request_id. # Save request_id; the polling URL needs it verbatim. Valid format # is ^req_[A-Za-z0-9]{1,64}$ — no dots, dashes, or whitespace. REQUEST_ID=$(echo "${RESP}" | jq -r .request_id) curl -sS "https://api.simulacra-data.com/v1/signup/${REQUEST_ID}" # -> pending until an operator approves; then status flips to # "approved" and includes `client_id` plus a one-time # `credential_claim_token`. APPROVAL=$(curl -sS "https://api.simulacra-data.com/v1/signup/${REQUEST_ID}") CLAIM_TOKEN=$(echo "${APPROVAL}" | jq -r .credential_claim_token) curl -sS -X POST "https://api.simulacra-data.com/v1/credential-claims" \ -H "content-type: application/json" \ -d "$(jq -nc --arg token "${CLAIM_TOKEN}" '{claim_token: $token}')" # -> returns client_id, client_secret, token_url, audience, grant_type. # The claim token is one-time-use; put client_secret in your # secret manager, not in source code or chat. ``` If you are testing this from the interactive docs panel below, click **TRY** on `POST /v1/signup`, fill the `company_name` and `contact_email` fields, and **leave the Authorization field blank** — this is the access-request step before credentials exist. ## Authorize The Interactive Panel The interactive docs authenticate with the **HTTP Bearer** field inside the **AUTHENTICATION** panel. That field expects an Auth0 JWT access token, not your `client_secret`. JWTs always start with `eyJ` and contain two dots; your `client_secret` has no fixed prefix and is a single high-entropy ~64-character string with no dots. Paste the wrong one and every protected call returns 401. Click **AUTHENTICATION** in the left navigation. If you already have a JWT, paste it into the HTTP Bearer field. If you only have your `client_id` and `client_secret`, use the *Exchange credentials and fill HTTP Bearer* form in that same panel. It exchanges the credentials server-side and loads the JWT into HTTP Bearer for you. Try-It on protected endpoints then succeeds. The token is valid for ~24 hours; rerun the form when it expires. If you prefer the command line: ```sh ACCESS_TOKEN=$(curl -sS -X POST "https://simulacra-data.us.auth0.com/oauth/token" \ -H "content-type: application/json" \ -d '{ "client_id": "YOUR_CLIENT_ID", "client_secret": "YOUR_CLIENT_SECRET", "audience": "https://api.simulacra-data.com", "grant_type": "client_credentials" }' | jq -r .access_token) echo "$ACCESS_TOKEN" ``` ### Rotating or replacing your client_secret If you still have a working `client_secret` (or a valid bearer token minted from it), use `POST /v1/credential-rotations` — it rotates the secret at Auth0 and returns a one-time `credential_claim_token` you redeem at `POST /v1/credential-claims` for the new `client_secret`. The endpoint is authenticated; the rate limit is three rotations per 24 hours per client. In the interactive docs, `POST /v1/credential-rotations` gets a per-tab `Idempotency-Key` automatically. If the browser reloads or the response disappears before you redeem the claim token, retry the same operation in that tab; the API returns the same `credential_claim_token` without rotating Auth0 again. If you have lost the `client_secret` entirely and cannot authenticate, email [support@simulacra-data.com](mailto:support@simulacra-data.com) and reference your `client_id`. Simulacra will rotate operator-side and deliver the new `credential_claim_token` over a secure channel. Resubmitting the signup form does NOT re-issue a `credential_claim_token` once your initial credential has been claimed. ## Quickstart Pick the language tab that matches your stack. Each script below is a complete, runnable end-to-end flow: it mints a bearer token, uploads a toy seed dataset, polls until training finishes, fetches the trained schema, generates scenario-conditioned synthetic rows, downloads the artifact, and reads it back. The flow is identical across languages — only the syntax changes — and the conditioning request body is the same JSON structure everywhere. The scenario examples are intentionally built from normal client-side objects — pandas data frames, R data frames, Julia DataFrames, or a shell variable for SPSS automation. Your HTTP client serializes those objects to JSON; you should not be hand-maintaining JSON files in a production integration. Always fetch `/v1/datasets/{dataset_id}/schema` before building conditions. The trained schema is the customer-facing contract after cleaning: names may be normalized, columns may be dropped, and rare categorical levels may be removed. The examples below resolve condition columns from the returned schema before submitting the generation request. All four scripts read these environment variables: ```sh export SIMIO_CLIENT_ID="..." export SIMIO_CLIENT_SECRET="..." export SIMIO_AUTH0_DOMAIN="simulacra-data.us.auth0.com" export SIMIO_AUTH0_AUDIENCE="https://api.simulacra-data.com" export SIMIO_API_BASE="https://api.simulacra-data.com" ``` Jump to: [Python](#quickstart-python) · [R](#quickstart-r) · [Julia](#quickstart-julia) · [SPSS](#quickstart-spss) ### Python Uses `requests` for HTTP, `pandas` + `numpy` for seed/scenario construction, and `pyarrow` (a `pandas` extra) for Parquet I/O. ```python import os, time, requests import numpy as np import pandas as pd AUTH0_DOMAIN = os.environ["SIMIO_AUTH0_DOMAIN"] AUTH0_AUDIENCE = os.environ["SIMIO_AUTH0_AUDIENCE"] CLIENT_ID = os.environ["SIMIO_CLIENT_ID"] CLIENT_SECRET = os.environ["SIMIO_CLIENT_SECRET"] API_BASE = os.environ["SIMIO_API_BASE"] # 1. Mint a bearer token. token = requests.post( f"https://{AUTH0_DOMAIN}/oauth/token", json={ "client_id": CLIENT_ID, "client_secret": CLIENT_SECRET, "audience": AUTH0_AUDIENCE, "grant_type": "client_credentials", }, timeout=30, ).json()["access_token"] auth = {"Authorization": f"Bearer {token}"} # 2. Build a toy seed dataset and write it to seed.csv. rng = np.random.default_rng(7) n = 800 df = pd.DataFrame({ "age": rng.integers(18, 66, n), "segment": rng.choice(["Value", "Mainstream", "Premium"], n, p=[0.35, 0.45, 0.20]), "channel": rng.choice(["Retail", "Online", "Club"], n, p=[0.50, 0.35, 0.15]), "region": rng.choice(["Northeast", "South", "Midwest", "West"], n), }) intent = 42.0 intent += (df["segment"] == "Premium") * 18 intent += (df["channel"] == "Online") * 8 intent += ((df["age"] - 35) / 3).clip(-8, 8) intent += rng.normal(0, 10, n) df["purchase_intent"] = intent.round().clip(0, 100).astype(int) df.to_csv("seed.csv", index=False) # Helpers: 202 Accepted is normal for training and generation. # This quickstart uses a simple 2s poll loop; production clients # should add jittered exponential backoff and a deadline appropriate # for their workload. def poll_job(job_id, field, deadline_seconds=600): deadline = time.time() + deadline_seconds while time.time() < deadline: body = requests.get(f"{API_BASE}/v1/jobs/{job_id}", headers=auth, timeout=30).json() if body["status"] in ("failed", "expired", "cancelled"): raise RuntimeError(f"job {job_id} {body['status']}: {body}") if body["status"] == "completed": value = body.get(field) if value: return value raise RuntimeError(f"job {job_id} completed without {field}: {body}") time.sleep(2) raise TimeoutError(f"job {job_id} did not complete before deadline") def wait_for_dataset(upload): if upload.get("status") == "ready" and upload.get("dataset_id"): return upload["dataset_id"] return poll_job(upload["job_id"], "dataset_id") def wait_for_generation(initial): generation_id = initial.get("generation_id") if initial.get("status") == "processing" and initial.get("job_id"): generation_id = poll_job(initial["job_id"], "generation_id") if not generation_id: raise RuntimeError(f"generation response lacked generation_id: {initial}") deadline = time.time() + 600 while time.time() < deadline: meta = requests.get( f"{API_BASE}/v1/generations/{generation_id}", headers=auth, timeout=30, ).json() if meta.get("status") in ("ready", "partial") and meta.get("artifact_url"): return meta if meta.get("status") == "failed": raise RuntimeError(f"generation failed: {meta}") time.sleep(2) raise TimeoutError(f"generation {generation_id} did not produce an artifact") # 3. Upload seed + train. Returns either the dataset_id directly # (synchronous) or a 202 with job_id (asynchronous). with open("seed.csv", "rb") as fh: upload = requests.post( f"{API_BASE}/v1/datasets", headers={**auth, "Idempotency-Key": f"dataset-{int(time.time())}"}, files={"seed_file": ("seed.csv", fh, "text/csv")}, data={"display_name": "Quickstart seed", "wait_seconds": "10"}, timeout=60, ).json() dataset_id = wait_for_dataset(upload) # 4. Inspect the trained schema. Use the returned column names, # levels, and ranges to build conditions; do not guess from the # original seed headers. schema = requests.get( f"{API_BASE}/v1/datasets/{dataset_id}/schema", headers=auth, timeout=30, ).json() schema_columns = schema.get("schema", []) schema_by_name = {col["name"]: col for col in schema_columns} def resolve_column(*candidates): for name in candidates: if name in schema_by_name: return name normalized = {name.replace("_", ".").lower(): name for name in schema_by_name} for name in candidates: hit = normalized.get(name.replace("_", ".").lower()) if hit: return hit raise KeyError( f"None of {candidates} is in the cleaned schema. Available: {list(schema_by_name)}" ) def require_levels(column, levels): available = set(schema_by_name[column].get("levels") or []) missing = set(levels) - available if missing: raise KeyError( f"Schema column {column!r} is missing levels {sorted(missing)}. " f"Available levels: {sorted(available)}" ) segment_col = resolve_column("segment") channel_col = resolve_column("channel") age_col = resolve_column("age") intent_col = resolve_column("purchase_intent", "purchase.intent") require_levels(segment_col, ["Premium", "Mainstream", "Value"]) require_levels(channel_col, ["Online", "Retail", "Club"]) # 5. Define a scenario as ordinary data frames, then convert to the # API condition object. Categorical target_share values are desired # outcome percentages, subject to feasibility jitter. categorical_targets = pd.DataFrame([ {"column": segment_col, "level": "Premium", "target_share": 0.55}, {"column": segment_col, "level": "Mainstream", "target_share": 0.35}, {"column": segment_col, "level": "Value", "target_share": 0.10}, {"column": channel_col, "level": "Online", "target_share": 0.70}, {"column": channel_col, "level": "Retail", "target_share": 0.20}, {"column": channel_col, "level": "Club", "target_share": 0.10}, ]) numeric_ranges = pd.DataFrame([ {"column": age_col, "min": 25, "max": 44}, {"column": intent_col, "min": 70, "max": 100}, ]) def categorical_conditions(targets): return { column: dict(zip(group["level"], group["target_share"])) for column, group in targets.groupby("column", sort=False) } def numeric_conditions(ranges): return { row.column: {"min": row.min, "max": row.max} for row in ranges.itertuples(index=False) } scenario = { "row_count": 5000, "output_format": "parquet", "seed": 20260430, "wait_seconds": 20, "conditions": { "categorical": categorical_conditions(categorical_targets), "numeric": numeric_conditions(numeric_ranges), }, } # 6. Generate the scenario-conditioned synthetic artifact. gen = requests.post( f"{API_BASE}/v1/datasets/{dataset_id}/generations", headers={**auth, "content-type": "application/json", "Idempotency-Key": f"generation-{int(time.time())}"}, json=scenario, timeout=60, ).json() meta = wait_for_generation(gen) # 7. Download the artifact. artifact_url is absolute for enterprise # tenants and relative (managed-mode relay) otherwise. artifact_url = meta["artifact_url"] if not artifact_url.startswith("http"): artifact_url = f"{API_BASE}{artifact_url}" art = requests.get(artifact_url, headers=auth, timeout=120) art.raise_for_status() with open("synthetic.parquet", "wb") as fh: fh.write(art.content) # 8. Read the result. synth = pd.read_parquet("synthetic.parquet") print(len(synth), "rows") print(synth.head()) ``` ### R Uses `httr2` for HTTP and `arrow` for Parquet I/O. ```r library(httr2) library(arrow) AUTH0_DOMAIN <- Sys.getenv("SIMIO_AUTH0_DOMAIN") AUTH0_AUDIENCE <- Sys.getenv("SIMIO_AUTH0_AUDIENCE") CLIENT_ID <- Sys.getenv("SIMIO_CLIENT_ID") CLIENT_SECRET <- Sys.getenv("SIMIO_CLIENT_SECRET") API_BASE <- Sys.getenv("SIMIO_API_BASE") # 1. Mint a bearer token. token <- request(paste0("https://", AUTH0_DOMAIN, "/oauth/token")) |> req_method("POST") |> req_body_json(list( client_id = CLIENT_ID, client_secret = CLIENT_SECRET, audience = AUTH0_AUDIENCE, grant_type = "client_credentials" )) |> req_perform() |> resp_body_json() TOKEN <- token$access_token auth <- function(req) req_auth_bearer_token(req, TOKEN) # 2. Build a toy seed dataset and write it to seed.csv. set.seed(7); n <- 800 seed <- data.frame( age = sample(18:65, n, replace = TRUE), segment = sample(c("Value", "Mainstream", "Premium"), n, replace = TRUE, prob = c(0.35, 0.45, 0.20)), channel = sample(c("Retail", "Online", "Club"), n, replace = TRUE, prob = c(0.50, 0.35, 0.15)), region = sample(c("Northeast", "South", "Midwest", "West"), n, replace = TRUE) ) intent <- 42 + (seed$segment == "Premium") * 18 + (seed$channel == "Online") * 8 + pmin(pmax((seed$age - 35) / 3, -8), 8) + rnorm(n, 0, 10) seed$purchase_intent <- pmax(pmin(round(intent), 100), 0) write.csv(seed, "seed.csv", row.names = FALSE) # Helper: a 202 response is normal for async work. Poll the job # until it is completed. Production clients should add jittered # exponential backoff and a deadline appropriate for their workload. poll_job <- function(job_id, field) { deadline <- Sys.time() + 600 repeat { if (Sys.time() > deadline) stop("job ", job_id, " timed out") body <- request(paste0(API_BASE, "/v1/jobs/", job_id)) |> auth() |> req_perform() |> resp_body_json() if (body$status %in% c("failed", "expired", "cancelled")) stop("job ", job_id, " ", body$status) if (identical(body$status, "completed")) { value <- body[[field]] if (!is.null(value) && nzchar(value)) return(value) stop("job ", job_id, " completed without ", field) } Sys.sleep(2) } } wait_for_generation <- function(initial) { generation_id <- initial$generation_id if (identical(initial$status, "processing") && !is.null(initial$job_id)) generation_id <- poll_job(initial$job_id, "generation_id") if (is.null(generation_id) || !nzchar(generation_id)) stop("generation response lacked generation_id") deadline <- Sys.time() + 600 repeat { if (Sys.time() > deadline) stop("generation ", generation_id, " timed out") meta <- request(paste0(API_BASE, "/v1/generations/", generation_id)) |> auth() |> req_perform() |> resp_body_json() if (meta$status %in% c("ready", "partial") && !is.null(meta$artifact_url) && nzchar(meta$artifact_url)) return(meta) if (identical(meta$status, "failed")) stop("generation failed") Sys.sleep(2) } } # 3. Upload seed + train. upload <- request(paste0(API_BASE, "/v1/datasets")) |> req_method("POST") |> auth() |> req_headers(`Idempotency-Key` = paste0("dataset-", as.integer(Sys.time()))) |> req_body_multipart( seed_file = curl::form_file("seed.csv", type = "text/csv"), display_name = "Quickstart seed", wait_seconds = "10" ) |> req_perform() |> resp_body_json() dataset_id <- if (identical(upload$status, "ready") && !is.null(upload$dataset_id) && nzchar(upload$dataset_id)) upload$dataset_id else poll_job(upload$job_id, "dataset_id") # 4. Inspect the trained schema. schema <- request(paste0(API_BASE, "/v1/datasets/", dataset_id, "/schema")) |> auth() |> req_perform() |> resp_body_json() schema_cols <- schema$schema schema_names <- vapply(schema_cols, `[[`, character(1), "name") resolve_column <- function(...) { candidates <- c(...) hit <- candidates[candidates %in% schema_names] if (length(hit)) return(hit[[1]]) normalized <- setNames(schema_names, tolower(gsub("_", ".", schema_names))) for (candidate in candidates) { key <- tolower(gsub("_", ".", candidate)) hit <- normalized[[key]] if (!is.null(hit) && !is.na(hit)) return(hit) } stop("None of ", paste(candidates, collapse = ", "), " is in the cleaned schema. Available: ", paste(schema_names, collapse = ", ")) } require_levels <- function(column, levels) { col <- schema_cols[[match(column, schema_names)]] available <- if (is.null(col$levels)) character() else unlist(col$levels) missing <- setdiff(levels, available) if (length(missing)) stop("Schema column ", column, " is missing levels: ", paste(missing, collapse = ", ")) } segment_col <- resolve_column("segment") channel_col <- resolve_column("channel") age_col <- resolve_column("age") intent_col <- resolve_column("purchase_intent", "purchase.intent") require_levels(segment_col, c("Premium", "Mainstream", "Value")) require_levels(channel_col, c("Online", "Retail", "Club")) # 5. Define a scenario as ordinary data frames, then convert to the # API condition object. target_share values are desired outcome # percentages, subject to feasibility jitter. categorical_targets <- data.frame( column = c(segment_col, segment_col, segment_col, channel_col, channel_col, channel_col), level = c("Premium", "Mainstream", "Value", "Online", "Retail", "Club"), target_share = c(0.55, 0.35, 0.10, 0.70, 0.20, 0.10) ) numeric_ranges <- data.frame( column = c(age_col, intent_col), min = c(25, 70), max = c(44, 100) ) categorical_conditions <- lapply( split(categorical_targets, categorical_targets$column), function(x) as.list(setNames(x$target_share, x$level)) ) numeric_conditions <- setNames( lapply(seq_len(nrow(numeric_ranges)), function(i) { list(min = numeric_ranges$min[[i]], max = numeric_ranges$max[[i]]) }), numeric_ranges$column ) scenario <- list( row_count = 5000L, output_format = "parquet", seed = 20260430L, wait_seconds = 20L, conditions = list( categorical = categorical_conditions, numeric = numeric_conditions ) ) # 6. Generate the scenario-conditioned synthetic artifact. gen <- request(paste0(API_BASE, "/v1/datasets/", dataset_id, "/generations")) |> req_method("POST") |> auth() |> req_headers(`Idempotency-Key` = paste0("generation-", as.integer(Sys.time()))) |> req_body_json(scenario) |> req_perform() |> resp_body_json() meta <- wait_for_generation(gen) # 7. Download the artifact. artifact_url <- if (startsWith(meta$artifact_url, "http")) meta$artifact_url else paste0(API_BASE, meta$artifact_url) request(artifact_url) |> auth() |> req_perform(path = "synthetic.parquet") # 8. Read the result. synth <- read_parquet("synthetic.parquet") cat(nrow(synth), "rows\n") head(synth) ``` ### Julia Uses `HTTP.jl` + `JSON3` for HTTP, `DataFrames` + `CSV` + `StatsBase` for the seed, and `Parquet2` for the result. ```julia using HTTP, JSON3, DataFrames, CSV, Random, StatsBase, Parquet2 AUTH0_DOMAIN = ENV["SIMIO_AUTH0_DOMAIN"] AUTH0_AUDIENCE = ENV["SIMIO_AUTH0_AUDIENCE"] CLIENT_ID = ENV["SIMIO_CLIENT_ID"] CLIENT_SECRET = ENV["SIMIO_CLIENT_SECRET"] API_BASE = ENV["SIMIO_API_BASE"] # 1. Mint a bearer token. token_resp = HTTP.post( "https://$AUTH0_DOMAIN/oauth/token", ["content-type" => "application/json"], JSON3.write(Dict( "client_id" => CLIENT_ID, "client_secret" => CLIENT_SECRET, "audience" => AUTH0_AUDIENCE, "grant_type" => "client_credentials", )), ) TOKEN = JSON3.read(token_resp.body).access_token auth() = ["Authorization" => "Bearer $TOKEN"] # 2. Build a toy seed dataset and write it to seed.csv. Random.seed!(7) n = 800 df = DataFrame( age = rand(18:65, n), segment = sample(["Value", "Mainstream", "Premium"], StatsBase.Weights([0.35, 0.45, 0.20]), n), channel = sample(["Retail", "Online", "Club"], StatsBase.Weights([0.50, 0.35, 0.15]), n), region = rand(["Northeast", "South", "Midwest", "West"], n), ) intent = 42.0 .+ (df.segment .== "Premium") .* 18 .+ (df.channel .== "Online") .* 8 .+ clamp.((df.age .- 35) ./ 3, -8, 8) .+ randn(n) .* 10 df.purchase_intent = clamp.(round.(Int, intent), 0, 100) CSV.write("seed.csv", df) # Helper: a 202 response is normal for async work. Poll the job # until it is completed. Production clients should add jittered # exponential backoff and a deadline appropriate for their workload. function poll_job(job_id, field) deadline = time() + 600 while true time() > deadline && error("job $job_id timed out") r = HTTP.get("$API_BASE/v1/jobs/$job_id", auth()) body = JSON3.read(r.body) body.status in ("failed", "expired", "cancelled") && error("job $job_id $(body.status)") if body.status == "completed" v = get(body, Symbol(field), nothing) v === nothing && error("job $job_id completed without $field") return string(v) end sleep(2) end end function wait_for_generation(initial) generation_id = get(initial, :generation_id, nothing) if get(initial, :status, "") == "processing" && get(initial, :job_id, nothing) !== nothing generation_id = poll_job(initial.job_id, "generation_id") end generation_id === nothing && error("generation response lacked generation_id") deadline = time() + 600 while true time() > deadline && error("generation $generation_id timed out") meta = JSON3.read(HTTP.get( "$API_BASE/v1/generations/$generation_id", auth()).body) if meta.status in ("ready", "partial") && get(meta, :artifact_url, nothing) !== nothing return meta end meta.status == "failed" && error("generation failed") sleep(2) end end # 3. Upload seed + train. upload = HTTP.post( "$API_BASE/v1/datasets", vcat(auth(), ["Idempotency-Key" => "dataset-$(round(Int, time()))"]), HTTP.Form(Dict( "seed_file" => HTTP.Multipart("seed.csv", open("seed.csv"), "text/csv"), "display_name" => "Quickstart seed", "wait_seconds" => "10", )), ) upload_body = JSON3.read(upload.body) dataset_id = get(upload_body, :dataset_id, nothing) dataset_id = get(upload_body, :status, "") == "ready" && !isnothing(dataset_id) ? string(dataset_id) : poll_job(upload_body.job_id, "dataset_id") # 4. Inspect the trained schema. schema = JSON3.read(HTTP.get( "$API_BASE/v1/datasets/$dataset_id/schema", auth()).body) schema_cols = collect(schema.schema) schema_names = [String(col.name) for col in schema_cols] function resolve_column(candidates...) for candidate in candidates string(candidate) in schema_names && return string(candidate) end normalized = Dict(lowercase(replace(name, "_" => ".")) => name for name in schema_names) for candidate in candidates key = lowercase(replace(string(candidate), "_" => ".")) haskey(normalized, key) && return normalized[key] end error("None of $(candidates) is in the cleaned schema. Available: $schema_names") end function require_levels(column, levels) idx = findfirst(==(column), schema_names) raw_levels = hasproperty(schema_cols[idx], :levels) ? schema_cols[idx].levels : String[] available = Set(string.(raw_levels)) missing = setdiff(Set(levels), available) !isempty(missing) && error("Schema column $column is missing levels: $missing") end segment_col = resolve_column("segment") channel_col = resolve_column("channel") age_col = resolve_column("age") intent_col = resolve_column("purchase_intent", "purchase.intent") require_levels(segment_col, ["Premium", "Mainstream", "Value"]) require_levels(channel_col, ["Online", "Retail", "Club"]) # 5. Define a scenario as ordinary DataFrames, then convert to the # API condition object. target_share values are desired outcome # percentages, subject to feasibility jitter. categorical_targets = DataFrame( column = [segment_col, segment_col, segment_col, channel_col, channel_col, channel_col], level = ["Premium", "Mainstream", "Value", "Online", "Retail", "Club"], target_share = [0.55, 0.35, 0.10, 0.70, 0.20, 0.10], ) numeric_ranges = DataFrame( column = [age_col, intent_col], min = [25, 70], max = [44, 100], ) function categorical_conditions(targets) Dict( col => Dict(row.level => row.target_share for row in eachrow(targets[targets.column .== col, :])) for col in unique(targets.column) ) end numeric_conditions = Dict( row.column => Dict("min" => row.min, "max" => row.max) for row in eachrow(numeric_ranges) ) scenario = Dict( "row_count" => 5000, "output_format" => "parquet", "seed" => 20260430, "wait_seconds" => 20, "conditions" => Dict( "categorical" => categorical_conditions(categorical_targets), "numeric" => numeric_conditions, ), ) # 6. Generate the scenario-conditioned synthetic artifact. gen = HTTP.post( "$API_BASE/v1/datasets/$dataset_id/generations", vcat(auth(), ["content-type" => "application/json", "Idempotency-Key" => "generation-$(round(Int, time()))"]), JSON3.write(scenario), ) gen_body = JSON3.read(gen.body) meta = wait_for_generation(gen_body) # 7. Download the artifact. artifact_url = startswith(meta.artifact_url, "http") ? String(meta.artifact_url) : "$API_BASE$(meta.artifact_url)" open("synthetic.parquet", "w") do io write(io, HTTP.get(artifact_url, auth()).body) end # 8. Read the result. synth = DataFrame(Parquet2.Dataset("synthetic.parquet")) println(nrow(synth), " rows") first(synth, 5) ``` ### SPSS SPSS does not have a native HTTP client, so the realistic flow is to drive the API from a small shell wrapper and then read the result with SPSS syntax. SPSS users typically already have survey data they want to use as a seed instead of synthesizing one — `seed_file` accepts CSV, Parquet, Excel, and SAV directly. Requires `curl` and `jq` (see [jq install docs](https://jqlang.org/download/)) on the box that runs the shell script. The shell wrapper resolves cleaned column names client-side; the API also validates submitted columns and levels against the cleaned schema and returns a 400 with a schema hint if they do not match. ```sh # 1. Mint a bearer token. TOKEN=$(curl -sS -X POST "https://${SIMIO_AUTH0_DOMAIN}/oauth/token" \ -H "content-type: application/json" \ -d "{ \"client_id\":\"${SIMIO_CLIENT_ID}\", \"client_secret\":\"${SIMIO_CLIENT_SECRET}\", \"audience\":\"${SIMIO_AUTH0_AUDIENCE}\", \"grant_type\":\"client_credentials\" }" | jq -r ".access_token") AUTH="Authorization: Bearer ${TOKEN}" # Helper: poll a 202 job until it completes. Production clients # should add jittered exponential backoff and a deadline matched # to their workload. poll_job_field() { job_id="$1"; field="$2"; deadline=$((SECONDS + 600)) while [ "${SECONDS}" -lt "${deadline}" ]; do sleep 2 job=$(curl -sS -H "${AUTH}" "${SIMIO_API_BASE}/v1/jobs/${job_id}") status=$(echo "${job}" | jq -r ".status // empty") if [ "${status}" = "failed" ] || [ "${status}" = "expired" ] || [ "${status}" = "cancelled" ]; then echo "${job}" >&2; exit 1 fi if [ "${status}" = "completed" ]; then value=$(echo "${job}" | jq -r ".${field} // empty") if [ -n "${value}" ]; then printf "%s" "${value}"; return 0; fi echo "${job}" >&2; echo "job completed without ${field}" >&2; exit 1 fi done echo "job ${job_id} timed out" >&2; exit 1 } wait_generation_artifact() { generation_id="$1"; deadline=$((SECONDS + 600)) while [ "${SECONDS}" -lt "${deadline}" ]; do meta=$(curl -sS -H "${AUTH}" "${SIMIO_API_BASE}/v1/generations/${generation_id}") status=$(echo "${meta}" | jq -r ".status // empty") artifact=$(echo "${meta}" | jq -r ".artifact_url // empty") if { [ "${status}" = "ready" ] || [ "${status}" = "partial" ]; } && [ -n "${artifact}" ]; then printf "%s" "${meta}"; return 0 fi if [ "${status}" = "failed" ]; then echo "${meta}" >&2; exit 1; fi sleep 2 done echo "generation ${generation_id} timed out" >&2; exit 1 } # 3. Upload seed + train. Use your existing SAV/CSV file here; we use # survey.sav as an example. UPLOAD=$(curl -sS -X POST "${SIMIO_API_BASE}/v1/datasets" \ -H "${AUTH}" -H "Idempotency-Key: dataset-$(date +%s)" \ -F "seed_file=@./survey.sav;type=application/x-spss-sav" \ -F "display_name=SPSS quickstart seed" \ -F "wait_seconds=10") DATASET_ID=$(echo "${UPLOAD}" | jq -r ".dataset_id // empty") JOB_ID=$(echo "${UPLOAD}" | jq -r ".job_id // empty") if [ -z "${DATASET_ID}" ]; then DATASET_ID=$(poll_job_field "${JOB_ID}" dataset_id); fi # 4. Inspect the trained schema. Use these cleaned names in # conditions; do not assume original seed headers survived. SCHEMA=$(curl -sS -H "${AUTH}" \ "${SIMIO_API_BASE}/v1/datasets/${DATASET_ID}/schema") echo "${SCHEMA}" | jq . resolve_col() { a="$1"; b="${2:-$1}" echo "${SCHEMA}" | jq -r --arg a "${a}" --arg b "${b}" \ '.schema[] | select(.name == $a or .name == $b) | .name' | head -n 1 } SEGMENT_COL=$(resolve_col segment) CHANNEL_COL=$(resolve_col channel) AGE_COL=$(resolve_col age) INTENT_COL=$(resolve_col purchase_intent purchase.intent) if [ -z "${SEGMENT_COL}" ] || [ -z "${CHANNEL_COL}" ] || [ -z "${AGE_COL}" ] || [ -z "${INTENT_COL}" ]; then echo "Expected quickstart columns were not all retained after cleaning" >&2; exit 1 fi # 5. Build the scenario request in a shell variable. Categorical # values are desired outcome percentages. Use output_format=csv # so the artifact lands in a format SPSS reads directly. SCENARIO=$(jq -cn --arg segment "${SEGMENT_COL}" --arg channel "${CHANNEL_COL}" \ --arg age "${AGE_COL}" --arg intent "${INTENT_COL}" '{ row_count: 5000, output_format: "csv", seed: 20260430, wait_seconds: 20, conditions: { categorical: { ($segment): {Premium: 0.55, Mainstream: 0.35, Value: 0.10}, ($channel): {Online: 0.70, Retail: 0.20, Club: 0.10} }, numeric: { ($age): {min: 25, max: 44}, ($intent): {min: 70, max: 100} } } }') GEN=$(curl -sS -X POST "${SIMIO_API_BASE}/v1/datasets/${DATASET_ID}/generations" \ -H "${AUTH}" -H "content-type: application/json" \ -H "Idempotency-Key: generation-$(date +%s)" \ --data "${SCENARIO}") GENERATION_ID=$(echo "${GEN}" | jq -r ".generation_id // empty") GEN_JOB_ID=$(echo "${GEN}" | jq -r ".job_id // empty") GEN_STATUS=$(echo "${GEN}" | jq -r ".status // empty") if [ "${GEN_STATUS}" = "processing" ] && [ -n "${GEN_JOB_ID}" ]; then GENERATION_ID=$(poll_job_field "${GEN_JOB_ID}" generation_id) fi # 6. Download the artifact. GEN_META=$(wait_generation_artifact "${GENERATION_ID}") ARTIFACT_URL=$(echo "${GEN_META}" | jq -r ".artifact_url") case "${ARTIFACT_URL}" in http*) curl -sS -L "${ARTIFACT_URL}" -o synthetic.csv ;; *) curl -sS -L -H "${AUTH}" \ "${SIMIO_API_BASE}${ARTIFACT_URL}" -o synthetic.csv ;; esac ``` Then read the result in SPSS — adjust the variable list to match your seed's column names and types: ```spss GET DATA /TYPE=TXT /FILE='synthetic.csv' /ENCODING='UTF8' /DELCASE=LINE /DELIMITERS="," /QUALIFIER='"' /ARRANGEMENT=DELIMITED /FIRSTCASE=2 /IMPORTCASE=ALL /VARIABLES= age F8.0 segment A20 channel A20 region A20 purchase_intent F8.0. CACHE. EXECUTE. ``` ## Conditioning Reference - Categorical values under `conditions.categorical` are desired outcome percentages: `0.70` means a 70% target share for that level, subject to feasibility jitter. - Numeric ranges under `conditions.numeric` are bilateral: include both `min` and `max`. - Always derive column names and levels from `/v1/datasets/{dataset_id}/schema`. Do not guess from memory or prompt context. - Tight scenarios can produce fewer rows than requested even when quota remains; usage-billing is counted on generated rows rather than requested rows. - Send an `Idempotency-Key` header on every `POST /v1/datasets` and `POST /v1/datasets/{id}/generations` so retries are safe. ## Notes For Coding Agents Agents should start at `/llms.txt`, use `/openapi.json` as the canonical contract, and fetch `/v1/datasets/{dataset_id}/schema` before constructing conditions. Do not infer private columns, dataset IDs, or credential values from examples. Preserve `X-Request-Id` and any response-body `request_id` in logs for support, but never log bearer tokens or client secrets. ## Client Operating Rules - Treat `202 Accepted` as normal; poll `/v1/jobs/{job_id}`. - Store `X-Request-Id` and any response-body `request_id` for support. - Use `/v1/datasets/{dataset_id}/schema` before conditional generation. - Tight scenarios can produce fewer rows than requested even when quota remains; usage-billing is counted on generated rows rather than requested rows. - Do not put client secrets in browsers, shared notebooks, or logs.