# aptdata – Full Documentation

> Consolidated reference for AI agents and language models.

---

## 1. Overview

aptdata is a declarative, extensible Python framework for building smart
data pipelines. It targets Python 3.10+ and relies on Pydantic for runtime
validation and OpenTelemetry for automatic observability.

The framework enforces a two-layer contract system:

1. **Layer 1 – Interfaces (`I*`):** Pure `@dataclass + ABC` contracts that
   declare abstract methods with no fields or implementation.
2. **Layer 2 – Base classes (`Base*`):** `@pydantic_dataclass` implementations
   that add validated fields but keep abstract methods for users to implement.

Three universal abstractions form the core: **Component**, **Flow**, and
**System**, plus the foundational **Dataset** type.

---

## 2. Architecture

The framework enforces a two-layer contract:

- Layer 1 (Interfaces): IDataset, IComponent, IFlow, ISystem — pure @dataclass + ABC
- Layer 2 (Base classes): BaseDataset, BaseComponent, BaseFlow, BaseSystem — @pydantic_dataclass

Hierarchy: Interfaces → Base classes → Your concrete implementations

### ComponentMeta & ComponentKind

Every `BaseComponent` carries a `ComponentMeta` with:
- `kind` (`ComponentKind` enum: TRANSFORM, FILTER, AGGREGATE, EXTRACT, LOAD, CUSTOM)
- `tags` (list of strings)
- `branch_on` (field name for conditional routing)
- `description` and `extra` metadata

### Flow Graph Primitives

- `FlowEdge(source_id, target_id, condition?)` — directed edge
- `FlowNode(component, flow?)` — node wrapper with back-reference

### Plugin Registry

```python
from aptdata.plugins import registry
registry.register("my_system", MySystem)
system_cls = registry.get("my_system")
```

---

## 3. Getting Started

### Installation

```bash
git clone https://github.com/strondata/smart-data.git
cd aptdata
poetry install
```

### Verify

```bash
aptdata --help
```

### Building a System

1. Create a `BaseDataset` subclass (implement `read()`, `write()`)
2. Create a `BaseComponent` subclass (implement `validate_inputs()`, `execute()`)
3. Create a `BaseFlow` subclass (implement `add_component()`, `connect()`, `compile()`, `run()`)
4. Create a `BaseSystem` subclass (implement `register_flow()`, `run()`)
5. Register the system: `registry.register("name", SystemClass)`
6. Run: `aptdata run name`

---

## 4. CLI Reference

All static commands support `--json` for machine-readable JSON line output.
Without `--json`, commands render Rich tables, panels, and highlighted output.

### `aptdata run PIPELINE [--env ENV] [--dry-run]`

Execute a registered system.

### `aptdata monitor [--refresh SECONDS]`

Open the interactive TUI dashboard with three tabs:
- DAG View: ASCII pipeline topology
- Metrics: Resource usage and memory bar
- Agent Trace: Real-time event log

### `aptdata scaffold PROJECT_NAME [--output PATH] [--template TEMPLATE]`

Generate a project from a scaffold template (hello-world, medallion,
rag-ingestion, data-quality-test).

### `aptdata schema export --output PATH`

Export JSON Schema for declarative YAML configs.

### `aptdata mcp-start [--transport stdio|sse]`

Start the MCP (Model Context Protocol) server for AI agent integration.

### `aptdata system list|info|validate`

Inspect and validate registered systems in the plugin registry.

- `system list [--json]` – list all registered systems
- `system info NAME [--json]` – show system class, module, docstring
- `system validate NAME` – instantiate system and compile flows

### `aptdata plugin list|inspect|preview|load`

Manage registered reader/writer plugins.

- `plugin list [--json]` – list all readers and writers
- `plugin inspect NAME [--json]` – show constructor argument schema
- `plugin preview READER [--limit N]` – preview first N records
- `plugin load MODULE_PATH` – dynamically import a module

### `aptdata config validate|init|show|run`

Manage declarative YAML pipeline configurations.

- `config validate PATH` – parse and validate a YAML file
- `config init [--output PATH]` – generate a starter YAML template
- `config show PATH` – pretty-print YAML with syntax highlighting
- `config run PATH [--env ENV]` – parse, register, and execute

### `aptdata telemetry status|export`

Inspect OpenTelemetry configuration and export telemetry data.

- `telemetry status [--json]` – show OTel provider status
- `telemetry export [--format json]` – export collected telemetry

### `aptdata interactive`

Launch the guided menu-driven interactive wizard.  Uses questionary for
prompts (falls back to typer.prompt if unavailable).

---

## 5. MCP Integration (Model Context Protocol)

aptdata includes a FastMCP server enabling AI agents (Claude Desktop,
Copilot, Devin, etc.) to discover and execute pipelines.

### MCP Tools

- **`run_flow(flow_id: str)`** — Execute a registered flow/system. Returns
  `{"status": "completed"|"error", "flow_id": ..., "elapsed_seconds": ...}`.
- **`list_registered_systems()`** — Returns
  `{"systems": [...], "count": N}`.

### MCP Resources

- **`schema://datasets/{dataset_name}`** — Read metadata for a specific dataset.

### Starting the MCP Server

```bash
aptdata mcp-start                  # stdio transport (default)
aptdata mcp-start --transport sse  # SSE transport
```

### Claude Desktop Integration

Add to your Claude Desktop `config.json`:

```json
{
  "mcpServers": {
    "aptdata": {
      "command": "aptdata",
      "args": ["mcp-start"]
    }
  }
}
```

---

## 6. Telemetry

BaseComponent subclasses are auto-instrumented via `__init_subclass__`,
creating OpenTelemetry spans around `execute()` with component metadata
attributes (component_id, kind, tags, branch_on, description).

Configure telemetry:

```python
from aptdata.telemetry import configure_telemetry
provider, meter = configure_telemetry(service_name="my-service")
```

---

## 7. Declarative Configuration

Systems can be defined via YAML and parsed with `YamlConfigParser`:

```python
from aptdata.config import YamlConfigParser
parser = YamlConfigParser()
config = parser.parse_file("pipeline.yaml")
```

Export schema: `aptdata schema export --output schema.json`

---

## 8. Workflow Layer

The `aptdata.core.workflow` module provides an alternative execution model:

- `IWorkflow` / `BaseWorkflow` — workflow with context, lifecycle hooks
  (`before_run`, `after_run`), topological sort via Kahn's algorithm
- `WorkflowNode` / `WorkflowEdge` — stdlib dataclasses with workflow reference
- Supports conditional branching via edge conditions

---

## 10. New Documentation Pages

The following dedicated documentation pages are available:

- **[Telemetry](telemetry.md)** — OpenTelemetry auto-instrumentation, span
  attributes, CLI commands (`telemetry status`, `telemetry export`), and
  backend integrations (Jaeger, OTLP).
- **[MCP Server](mcp.md)** — Full Model Context Protocol integration guide,
  available tools, resources, and AI agent setup instructions.
- **[Configuration](configuration.md)** — `aptdata.yaml` format, schema,
  CLI commands (`config validate`, `config init`, `config show`, `config run`),
  and environment variable substitution.
- **[Scaffold Templates](scaffold-templates.md)** — Project bootstrapping with
  `hello-world`, `medallion`, `rag-ingestion`, `data-quality-test`,
  `job-wheel`, and `docker-compose-app` templates.
- **[API – CLI Interactive](api/cli-interactive.md)** — Interactive wizard
  (`aptdata interactive`) powered by questionary.

---

## 11. Project Structure

```
aptdata/
├── __init__.py          # version
├── core/                # IDataset, IComponent, IFlow, ISystem, workflow
├── cli/                 # Typer CLI (run, monitor, scaffold, schema, mcp)
├── config/              # YAML parser, schema export
├── mcp/                 # FastMCP server (tools, resources)
├── plugins/             # System registry
├── telemetry/           # OpenTelemetry instrumentation
└── tui/                 # Textual monitoring dashboard
```

---

## 12. Links

- Source: https://github.com/strondata/smart-data
- Documentation: https://strondata.github.io/smart-data