# aptdata – Full Documentation > Consolidated reference for AI agents and language models. --- ## 1. Overview aptdata is a declarative, extensible Python framework for building smart data pipelines. It targets Python 3.10+ and relies on Pydantic for runtime validation and OpenTelemetry for automatic observability. The framework enforces a two-layer contract system: 1. **Layer 1 – Interfaces (`I*`):** Pure `@dataclass + ABC` contracts that declare abstract methods with no fields or implementation. 2. **Layer 2 – Base classes (`Base*`):** `@pydantic_dataclass` implementations that add validated fields but keep abstract methods for users to implement. Three universal abstractions form the core: **Component**, **Flow**, and **System**, plus the foundational **Dataset** type. --- ## 2. Architecture The framework enforces a two-layer contract: - Layer 1 (Interfaces): IDataset, IComponent, IFlow, ISystem — pure @dataclass + ABC - Layer 2 (Base classes): BaseDataset, BaseComponent, BaseFlow, BaseSystem — @pydantic_dataclass Hierarchy: Interfaces → Base classes → Your concrete implementations ### ComponentMeta & ComponentKind Every `BaseComponent` carries a `ComponentMeta` with: - `kind` (`ComponentKind` enum: TRANSFORM, FILTER, AGGREGATE, EXTRACT, LOAD, CUSTOM) - `tags` (list of strings) - `branch_on` (field name for conditional routing) - `description` and `extra` metadata ### Flow Graph Primitives - `FlowEdge(source_id, target_id, condition?)` — directed edge - `FlowNode(component, flow?)` — node wrapper with back-reference ### Plugin Registry ```python from aptdata.plugins import registry registry.register("my_system", MySystem) system_cls = registry.get("my_system") ``` --- ## 3. Getting Started ### Installation ```bash git clone https://github.com/strondata/smart-data.git cd aptdata poetry install ``` ### Verify ```bash aptdata --help ``` ### Building a System 1. Create a `BaseDataset` subclass (implement `read()`, `write()`) 2. Create a `BaseComponent` subclass (implement `validate_inputs()`, `execute()`) 3. Create a `BaseFlow` subclass (implement `add_component()`, `connect()`, `compile()`, `run()`) 4. Create a `BaseSystem` subclass (implement `register_flow()`, `run()`) 5. Register the system: `registry.register("name", SystemClass)` 6. Run: `aptdata run name` --- ## 4. CLI Reference All static commands support `--json` for machine-readable JSON line output. Without `--json`, commands render Rich tables, panels, and highlighted output. ### `aptdata run PIPELINE [--env ENV] [--dry-run]` Execute a registered system. ### `aptdata monitor [--refresh SECONDS]` Open the interactive TUI dashboard with three tabs: - DAG View: ASCII pipeline topology - Metrics: Resource usage and memory bar - Agent Trace: Real-time event log ### `aptdata scaffold PROJECT_NAME [--output PATH] [--template TEMPLATE]` Generate a project from a scaffold template (hello-world, medallion, rag-ingestion, data-quality-test). ### `aptdata schema export --output PATH` Export JSON Schema for declarative YAML configs. ### `aptdata mcp-start [--transport stdio|sse]` Start the MCP (Model Context Protocol) server for AI agent integration. ### `aptdata system list|info|validate` Inspect and validate registered systems in the plugin registry. - `system list [--json]` – list all registered systems - `system info NAME [--json]` – show system class, module, docstring - `system validate NAME` – instantiate system and compile flows ### `aptdata plugin list|inspect|preview|load` Manage registered reader/writer plugins. - `plugin list [--json]` – list all readers and writers - `plugin inspect NAME [--json]` – show constructor argument schema - `plugin preview READER [--limit N]` – preview first N records - `plugin load MODULE_PATH` – dynamically import a module ### `aptdata config validate|init|show|run` Manage declarative YAML pipeline configurations. - `config validate PATH` – parse and validate a YAML file - `config init [--output PATH]` – generate a starter YAML template - `config show PATH` – pretty-print YAML with syntax highlighting - `config run PATH [--env ENV]` – parse, register, and execute ### `aptdata telemetry status|export` Inspect OpenTelemetry configuration and export telemetry data. - `telemetry status [--json]` – show OTel provider status - `telemetry export [--format json]` – export collected telemetry ### `aptdata interactive` Launch the guided menu-driven interactive wizard. Uses questionary for prompts (falls back to typer.prompt if unavailable). --- ## 5. MCP Integration (Model Context Protocol) aptdata includes a FastMCP server enabling AI agents (Claude Desktop, Copilot, Devin, etc.) to discover and execute pipelines. ### MCP Tools - **`run_flow(flow_id: str)`** — Execute a registered flow/system. Returns `{"status": "completed"|"error", "flow_id": ..., "elapsed_seconds": ...}`. - **`list_registered_systems()`** — Returns `{"systems": [...], "count": N}`. ### MCP Resources - **`schema://datasets/{dataset_name}`** — Read metadata for a specific dataset. ### Starting the MCP Server ```bash aptdata mcp-start # stdio transport (default) aptdata mcp-start --transport sse # SSE transport ``` ### Claude Desktop Integration Add to your Claude Desktop `config.json`: ```json { "mcpServers": { "aptdata": { "command": "aptdata", "args": ["mcp-start"] } } } ``` --- ## 6. Telemetry BaseComponent subclasses are auto-instrumented via `__init_subclass__`, creating OpenTelemetry spans around `execute()` with component metadata attributes (component_id, kind, tags, branch_on, description). Configure telemetry: ```python from aptdata.telemetry import configure_telemetry provider, meter = configure_telemetry(service_name="my-service") ``` --- ## 7. Declarative Configuration Systems can be defined via YAML and parsed with `YamlConfigParser`: ```python from aptdata.config import YamlConfigParser parser = YamlConfigParser() config = parser.parse_file("pipeline.yaml") ``` Export schema: `aptdata schema export --output schema.json` --- ## 8. Workflow Layer The `aptdata.core.workflow` module provides an alternative execution model: - `IWorkflow` / `BaseWorkflow` — workflow with context, lifecycle hooks (`before_run`, `after_run`), topological sort via Kahn's algorithm - `WorkflowNode` / `WorkflowEdge` — stdlib dataclasses with workflow reference - Supports conditional branching via edge conditions --- ## 10. New Documentation Pages The following dedicated documentation pages are available: - **[Telemetry](telemetry.md)** — OpenTelemetry auto-instrumentation, span attributes, CLI commands (`telemetry status`, `telemetry export`), and backend integrations (Jaeger, OTLP). - **[MCP Server](mcp.md)** — Full Model Context Protocol integration guide, available tools, resources, and AI agent setup instructions. - **[Configuration](configuration.md)** — `aptdata.yaml` format, schema, CLI commands (`config validate`, `config init`, `config show`, `config run`), and environment variable substitution. - **[Scaffold Templates](scaffold-templates.md)** — Project bootstrapping with `hello-world`, `medallion`, `rag-ingestion`, `data-quality-test`, `job-wheel`, and `docker-compose-app` templates. - **[API – CLI Interactive](api/cli-interactive.md)** — Interactive wizard (`aptdata interactive`) powered by questionary. --- ## 11. Project Structure ``` aptdata/ ├── __init__.py # version ├── core/ # IDataset, IComponent, IFlow, ISystem, workflow ├── cli/ # Typer CLI (run, monitor, scaffold, schema, mcp) ├── config/ # YAML parser, schema export ├── mcp/ # FastMCP server (tools, resources) ├── plugins/ # System registry ├── telemetry/ # OpenTelemetry instrumentation └── tui/ # Textual monitoring dashboard ``` --- ## 12. Links - Source: https://github.com/strondata/smart-data - Documentation: https://strondata.github.io/smart-data