> ## Documentation Index
> Fetch the complete documentation index at: https://arkor-92aeef0e-eng-615.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Trainer

> createTrainer: required fields, dataset sources, LoRA settings, and the start / wait / cancel lifecycle.

`createTrainer` is the heart of an Arkor project. Everything Arkor knows about a fine-tuning run, from the base model down to the LoRA rank, is on the object you pass in.

## Minimal example

```ts theme={null}
import { createTrainer } from "arkor";

export const trainer = createTrainer({
  name: "support-bot-v1",
  model: "unsloth/gemma-4-E4B-it",
  dataset: { type: "huggingface", name: "arkorlab/triage-demo" },
});
```

That alone is a valid trainer. Defaults are filled in for everything else.

## Required fields

| Field     | Type            | Notes                                                                       |
| --------- | --------------- | --------------------------------------------------------------------------- |
| `name`    | `string`        | Job name shown in Studio and the managed backend.                           |
| `model`   | `string`        | A model identifier the backend recognizes. See [Supported models](/models). |
| `dataset` | `DatasetSource` | See [Dataset sources](#dataset-sources).                                    |

## Dataset sources

`DatasetSource` is a discriminated union on `type`:

```ts theme={null}
type DatasetSource =
  | { type: "huggingface"; name: string; split?: string; subset?: string }
  | { type: "blob";        url: string;  token?: string };
```

* **HuggingFace**. The most common shape. Arkor pulls the dataset by name. Use `split` to override the default split and `subset` for datasets that publish multiple subsets.
* **Blob URL**. Any HTTPS URL the backend can fetch. Pass `token` if the backend needs auth to retrieve the URL; the value is forwarded to the cloud-api and used during the dataset fetch (the exact wire format is backend-defined and not part of the SDK contract).

## LoRA configuration

Pass `lora` to control LoRA settings. All four fields are typed:

```ts theme={null}
lora?: {
  r: number;             // LoRA rank
  alpha: number;         // LoRA alpha
  maxLength?: number;    // Maximum sequence length
  loadIn4bit?: boolean;  // QLoRA quantization
}
```

If you omit `lora`, the backend applies sensible defaults. `r: 16, alpha: 16` is a good starting point for the bundled templates.

## Common hyperparameters

| Field             | Type     | What it does                                            |
| ----------------- | -------- | ------------------------------------------------------- |
| `maxSteps`        | `number` | Cap on training steps. Often the simplest knob to turn. |
| `numTrainEpochs`  | `number` | Alternative to `maxSteps`: number of dataset passes.    |
| `learningRate`    | `number` | Step size for the optimizer.                            |
| `batchSize`       | `number` | Per-device training batch size.                         |
| `optim`           | `string` | Optimizer name (the backend list governs valid values). |
| `lrSchedulerType` | `string` | LR schedule (linear, cosine, etc).                      |
| `weightDecay`     | `number` | Regularization weight.                                  |

If you only set `maxSteps`, the rest stay at backend defaults. That is usually what you want for the first few runs.

## Smoke testing with `dryRun`

```ts theme={null}
createTrainer({
  name: "smoke",
  model: "unsloth/gemma-4-E4B-it",
  dataset: { type: "huggingface", name: "arkorlab/triage-demo" },
  dryRun: true,
});
```

`dryRun: true` tells the backend to run a minimal end-to-end smoke test of the trainer: the full pipeline (including the training loop) executes against a truncated dataset / capped step count so it finishes quickly. It still uses GPU time, just much less of it. Useful in CI or when wiring up callbacks for the first time.

## What `createTrainer` returns

```ts theme={null}
interface Trainer {
  readonly name: string;
  start(): Promise<{ jobId: string }>;
  wait(): Promise<TrainingResult>;
  cancel(): Promise<void>;
}
```

* **`start()`** submits the job to the managed backend and resolves with the assigned `jobId`. It does not wait for completion, and it does **not** dispatch any callbacks on its own.
* **`wait()`** opens the SSE event stream for the run and returns once the run finishes (or fails). All registered callbacks fire from inside `wait()`; if you call `start()` without later calling `wait()`, no callbacks ever run.
* **`cancel()`** asks the backend to stop the run. This is a best-effort request: the backend may return an error if the run is already in a terminal state (completed, failed, or already cancelled), so be prepared to catch.

`arkor start` calls `start()` and `wait()` for you (it is what Studio's "Run training" button spawns under the hood). `arkor dev` does **not** run the trainer; it only boots the Studio UI. Call `start()` and `wait()` directly only if you wire training into your own code outside the CLI.

## Stopping `wait()` with `AbortSignal`

```ts theme={null}
const controller = new AbortController();
const trainer = createTrainer({
  name: "cancellable",
  model: "unsloth/gemma-4-E4B-it",
  dataset: { type: "huggingface", name: "arkorlab/triage-demo" },
  abortSignal: controller.signal,
});

// Later, from anywhere:
controller.abort();
```

`abortSignal` is **only** about your local `wait()`: aborting it stops the SSE event-stream fetch and any retry / backoff delays inside `wait()`. The current implementation throws on abort (the backoff `delay` rejects with `signal.reason`, and the failure handler re-throws when the signal is aborted), so `wait()` rejects rather than resolving cleanly. Wrap it in `try / catch` if you abort:

```ts theme={null}
try {
  await trainer.wait();
} catch (err) {
  if (controller.signal.aborted) {
    // expected: we asked wait() to stop
  } else {
    throw err;
  }
}
```

`abortSignal` does **not** call `cancel()` and does **not** ask the backend to stop the run; the job keeps using GPU time on the managed side. If you want to actually stop training (and the cost), call `trainer.cancel()` separately:

```ts theme={null}
controller.abort();          // local wait() rejects
await trainer.cancel();      // asks the backend to stop the job
```

Use `abortSignal` for "I no longer care about waiting on this run" (a request timed out, a parent process is exiting). Use `cancel()` for "stop the run on the backend".

## Reacting to events

The whole point of doing this in TypeScript is that you can hook into the run with [lifecycle callbacks](/concepts/lifecycle): `onStarted`, `onLog`, `onCheckpoint`, `onCompleted`, `onFailed`. That is the next concept to read.
