Skip to content

Delta — kb/domain_model vs. in-package domain-modelling

Why two domain models exist

The in-package domain-modelling/ directory is anchored to automated generation: gen_linkml_schema.py walks every importable module under the commodity_hindcast package, finds Pydantic models, dataclasses, and Enum subclasses, and emits a LinkML YAML schema. The output is exhaustive on Python classes — it captures every attribute, required flag, range annotation, and enum value that the Pydantic type system can express — but it is explicitly labelled "structural only" in its own header: "Editorial information (DDD role, bounded context, identifier) belongs in human-maintained Markdown, not here." This kb/domain_model is that human-maintained layer. It names value objects (Commodity, SeasonYear, InitDate, Region, Yield, Fold, RunDir) that are NewType aliases, integer fields, or path conventions rather than Pydantic classes, so the generator skips them entirely. It documents behavioural roles (FeatureBuilder, Detrender, Regressor, MetaModel, ReferenceYieldLoader) that are Protocol or ABC abstractions — again outside the generator's scope. It performs DDD aggregate analysis, draws explicit consistency-boundary decisions, identifies 11 bounded contexts with a Mermaid context map, and names the one known layering violation. The two views are deliberately complementary: schema.yaml is the authoritative machine-readable record of Pydantic field-level details; this kb model is the authoritative conceptual scaffold.

What the in-package domain model contains

DOMAIN_MODEL.md

DOMAIN_MODEL.md (market_insights_models/src/commodity_hindcast/domain-modelling/DOMAIN_MODEL.md) is described in its own header as the "editorial domain model": ubiquitous language, bounded contexts, aggregates, invariants, the slice abstraction, and the package import DAG. It was written by hand and is intended to be maintained by hand. Its structure is:

  • §1 Ubiquitous Language — three vocabulary tables (temporal, geographic, data-shape, unit, lifecycle, model-mechanics).
  • §2 Bounded Contexts — a table of seven contexts with aggregate roots and ubiquitous language per context.
  • §3 Mermaid Domain Diagram — a full flowchart of the pipeline with inputs, lib layer, stage chain, cross-cutting tracking, and outputs. Includes a companion ER as two focused sub-diagrams.
  • §4 Markdown Domain Model — entity list, value-object list, relationship table, aggregate table, pipeline phase table, key invariants, and a validation-scenario walk-through.
  • §5 Structural schema pointer.
  • §6 Validation checklist.
  • §7 Slice abstraction — the AbstractSlice protocol, the HindcastSlice / ForecastSlice concrete shapes, and ExperimentResult.
  • §8 Package import DAG — the single-direction layer rules and the cycle audit.

Strengths: the ubiquitous language glossary and the import DAG rules are thorough and directly code-derived. The slice abstraction section (§7) is the canonical explanation of AbstractSlice, cutoff, and the training delegation pattern.

Limitations: the bounded-context table lists seven contexts; this kb model refines and extends that to eleven (splitting out Reference Data, Geo & Identifiers, and Dashboard as first-class contexts). DOMAIN_MODEL.md has no value-object entries for Commodity, SeasonYear, InitDate, Region, Yield, or Fold as domain concepts in their own right, and no DDD aggregate-boundary rationale beyond the invariant tables. The Mermaid flowchart is a pipeline execution diagram, not a context map.

DOMAIN_MODEL2.md

DOMAIN_MODEL2.md (market_insights_models/src/commodity_hindcast/domain-modelling/DOMAIN_MODEL2.md) carries the version note "Version 2.0 (rewrite for canonical structure), last updated 2026-04-29". Its ten-section structure is:

  • §1 Overview — a two-mode summary with a top-level Mermaid flowchart.
  • §2 Glossary — single alphabetised table of domain terms.
  • §3 Bounded Contexts — eight-context table with purpose, owned entities, ubiquitous language, and integration columns.
  • §4 Entity Catalogue — eight entries in dependency order (ExperimentConfig, CommodityConfig, Builder, ExperimentResult, HindcastSlice, ForecastSlice, HindcastDelivery → DeliveryRow, EditRule, Check, and a compact value-object table for ten further VOs).
  • §5 Relationships — canonical relationship table with cardinality.
  • §6 Aggregates & Invariants — five aggregate root entries.
  • §7 Lifecycles — five lifecycle sections (pipeline DAG, preflight gate, fold transition, hindcast-vs-forecast mode switch, EditRule fire-and-apply), each with a Mermaid diagram.
  • §8 Scenarios — four concrete walk-throughs against real function calls.
  • §9 Open Questions — known ambiguities (marketing year, selection bias correction, wheat sub-types, CalibrationResult, context boundary for conformal helpers).
  • Appendix A — Mermaid ERD attribute reference.
  • Appendix B — LinkML schema pointer.
  • Appendix C — Change log.

Strengths: the lifecycle sections (§7) and the scenarios (§8) are a substantial addition over DOMAIN_MODEL.md; they give concrete function-call sequences for the most common workflows. The open questions section (§9) names unresolved design ambiguities. The entity catalogue follows a consistent template.

Limitations: the bounded-context table lists eight contexts (vs. eleven in this kb model; Reference Data and Geo & Identifiers are absent as named contexts). Aggregate coverage is five roots vs. seven in this kb model (the protocol-level aggregate ExperimentProtocolConfig + Fold schedule and the Check list aggregate are absent). No Mermaid context-map (dependency-direction) diagram exists. No anti-corruption layer analysis. The entity catalogue gives field tables for Pydantic classes but does not name value objects (Commodity, SeasonYear, InitDate, Fold, RunDir) as conceptual domain objects.

schema.yaml

schema.yaml (market_insights_models/src/commodity_hindcast/domain-modelling/schema.yaml) is the LinkML source-of-truth for auto-generated structural documentation. It is regenerated by running regen_schema.sh (which invokes gen_linkml_schema.py) against the live package. It captures: every Pydantic BaseModel and dataclass as a LinkML class; every field as an attribute with required, range, and multivalued flags; every Enum as a LinkML enum. The source_module annotation on each class traces back to the Python module that defines it.

Authoritative for: class names, attribute names, attribute types, required vs optional, multivalued flags, enum values. If schema.yaml and any markdown doc disagree on an attribute name or type, schema.yaml is correct.

Not present in schema.yaml: value objects that are NewType aliases (e.g. GeoIdentifier); Protocol / ABC classes (AbstractDetrend, AbstractRegressionImpl, BuilderFn, AbstractSlice); module-level functions (apply_conformal, primary_calibration); the CalibrationResult class from models/meta_models/conformalise.py (flagged separately below).

What this kb/domain_model adds

  • Value-object entities (Commodity, SeasonYear, InitDate, Region, Yield, Fold, RunDir) | ENTITIES.md Tier 1 | These are not Python classes — they are NewType aliases, string/int/date fields, or path conventions. The LinkML generator skips them because it only discovers BaseModel, dataclass, and Enum subclasses. They are real domain objects with cardinality and constraints worth naming: SeasonYear is the primary grouping key in feature parquets, fold labels, and delivery CSVs; InitDate governs the in-season feature cutoff; GeoIdentifier / Region is the canonical join key across every artefact.

  • Behavioural roles (FeatureBuilder, Detrender, Regressor, MetaModel, ReferenceYieldLoader) | ENTITIES.md Tier 4 | Protocol and ABC abstractions are not Pydantic models; gen_linkml_schema.py does not emit them. Yet they define the extension contracts of the pipeline: new detrenders, regressors, and builders are added by subclassing or implementing these roles, not by modifying concrete classes.

  • DDD aggregate analysis with rationale | AGGREGATES.md | The in-package models give aggregate tables with invariants but not the design rationale for why boundaries were drawn where they are. AGGREGATES.md adds four "Aggregate boundary decision" sections explaining why ExperimentConfig is one aggregate rather than seven, why ExperimentResult is a lazy handle, why HindcastSlice and ForecastSlice are children rather than independent roots, and why HindcastDelivery is separate from ExperimentResult.

  • Two additional aggregates (ExperimentProtocolConfig + Fold schedule and Check list) | AGGREGATES.md | DOMAIN_MODEL2.md §6 covers five aggregate roots. This kb model identifies seven, adding the ExperimentProtocolConfig + Fold schedule aggregate (the walk-forward CV schedule, which has its own consistency rule: test_years must be non-empty and ordered) and the Check list aggregate (ephemeral per-preflight-call gate).

  • Eleven bounded contexts with a Mermaid context map and anti-corruption layer analysis | BOUNDED_CONTEXTS.md | DOMAIN_MODEL.md §2 lists seven contexts; DOMAIN_MODEL2.md §3 lists eight. This kb model identifies eleven, splitting out Reference Data, Geo & Identifiers, and Dashboard as first-class contexts with their own ubiquitous language, public surface, and boundary contracts. It also adds an explicit anti-corruption layer table naming the six translation points (raw NASS units → kg/ha, FIPS → GeoIdentifier, internal kg/ha → delivery units, WASDE marketing-year alignment, observed ERA5 + climo → forecast feature matrix, and the tracked tech-debt edge from delivery/conversions.py into stages/run_meta_models.py).

  • Three Mermaid ER diagrams organised by concern | ER_DIAGRAM.md | The in-package models have a pipeline-flow Mermaid flowchart and two companion ER diagrams in DOMAIN_MODEL.md. This kb model produces three new diagrams partitioned by concern (configuration aggregate, pipeline artefacts, behavioural roles) with field-level attribute blocks, enabling a reader to navigate the domain by structure rather than by execution order.

  • Five-tier entity organisation | ENTITIES.md | The in-package entity catalogue groups by dependency order or alphabetically. This kb model organises by tier: Tier 1 (core domain value objects), Tier 2 (configuration aggregates), Tier 3 (pipeline artefacts), Tier 4 (behavioural roles), Tier 5 (delivery and validation). Each tier has a distinct colour in diagrams and a distinct "kind" label in the top-tier table in this README.

  • Explicit NassSpec / ConformalExperiment / ADM0RowADM2Row retractions | ENTITIES.md Notes section | The orchestrator's seed vocabulary contained three names that do not exist in code. ENTITIES.md documents these corrections with exact evidence so that future LLM passes do not reintroduce the phantom names.

  • 73 named inter-entity relationships in nine sections | RELATIONSHIPS.md | DOMAIN_MODEL2.md §5 gives a single relationship table with twelve rows. This kb model enumerates 73 relationships across configuration, pipeline, reference-data, feature-assembly, delivery, behavioural-role, cross-cutting, artefact-schema, and temporal-vocabulary sections, each with a code citation.

Where the two models overlap and one will become stale

Pydantic class attributes are exhaustively covered by schema.yaml and DOMAIN_MODEL.md. ENTITIES.md should not try to enumerate every attribute — it should cite the LinkML doc and quote only the fields that matter for the conceptual model (identity, discriminator, key invariant-bearing fields). If schema.yaml is regenerated and a new attribute appears, ENTITIES.md need not be updated unless the attribute changes the aggregate boundary or the entity's conceptual role.

Entity catalogue entries for core Pydantic classes (ExperimentConfig, CommodityConfig, HindcastSlice, ForecastSlice, HindcastDelivery, DeliveryRow, ExperimentResult) appear in both DOMAIN_MODEL2.md §4 and ENTITIES.md Tiers 2–3. These entries will diverge over time. The maintenance rule is: schema.yaml is the source of truth for field-level structural details; DOMAIN_MODEL2.md §4 is the source of truth for narrative descriptions and code-citations; ENTITIES.md is the source of truth for DDD classification (kind, tier, aggregate membership).

Lifecycle and scenario sections in DOMAIN_MODEL2.md §7–8 may overlap with future kb pipelines/ pages written in Phase P5. When those pages are written, they should cite DOMAIN_MODEL2.md §7–8 directly and not duplicate the walk-through prose; they should extend it with artefact schema details and cross-links to entity pages.

Maintenance rule: schema.yaml is the source of truth for Pydantic field-level details; this kb is the source of truth for the conceptual model (tiers, aggregates, bounded contexts, anti-corruption layers, relationship semantics).

Where the two models disagree

The actors writing ENTITIES.md identified three cases where the orchestrator's seed vocabulary (which was informed by the in-package models) names things that do not exist in code, or mis-classifies things that do:

NassSpec — appears in the orchestrator's seed list and is implicitly present in DOMAIN_MODEL.md §1 (the ubiquitous language glossary lists NASS as a data source). In code, NASS yield is loaded by YieldsBuilder as a feature, not through the ReferenceYieldSpec discriminated union. The union is WasdeRefSpec | ConabFinalRefSpec | ConabLevantamentoRefSpec (lib/reference_data/loader.py:59). There is no NassSpec class anywhere in the package. The kb model is correct; the seed vocabulary was wrong.

ConformalExperiment — the seed list mentioned a class by this name. The module models/meta_models/conformalise.py defines CalibrationResult and a module-level function apply_conformal but no ConformalExperiment class. The kb model drops this name; CalibrationResult is the correct entity.

ADM0Row / ADM1Row / ADM2Row — the orchestrator's seed list and an informal reading of DOMAIN_MODEL.md suggest three separate delivery row classes for different ADM levels. In code there is one DeliveryRow class (delivery/schemas.py:109); the ADM level is determined by the geo_identifier prefix on each row, not by the class. DOMAIN_MODEL2.md §4.7 also uses the single-class description. Both the kb model and DOMAIN_MODEL2.md are correct; the seed vocabulary's three-class framing was wrong.

CalibrationResult persistenceAGGREGATES.md notes that conformal half-widths are computed by primary_calibration() and written into postprocessed/national.parquet and sidecar parquets under run_dir/conformal/, and that CalibrationResult is a transient value returned from stages/run_meta_models.py. DOMAIN_MODEL2.md §9 also lists it as an open question. ENTITIES.md Tier 3 documents the class as it appears in the source (models/meta_models/conformalise.py:111) with its persistence methods (save / load). The discrepancy is between the code definition (class has save/load methods writing to a parquet) and the AGGREGATES.md note (marks it transient). The ENTITIES.md account based on reading the source file is correct; the AGGREGATES.md note is a conservative hedge.

ConformalConfig as a separate classENTITIES.md documents ConformalConfig as a Tier 2 entity describing the tuple of conformal modes inside PostprocessConfig. schema.yaml does not emit a separate ConformalConfig class — the conformalise field is just a multivalued string attribute on PostprocessConfig. This is consistent: there is no Pydantic class named ConformalConfig; ENTITIES uses the name as a conceptual label for the config sub-component. Both are correct; schema.yaml is more precise on the implementation.

Recommendations

Items that should eventually be added to the in-package model:

  • Document GeoIdentifier, SeasonYear, InitDate, and Fold as named value objects in DOMAIN_MODEL.md §4.2 (Value Objects table). They are real domain concepts with cardinality constraints, not mere implementation details.
  • Add Reference Data and Geo & Identifiers as named bounded contexts in DOMAIN_MODEL.md §2 and DOMAIN_MODEL2.md §3. Both are currently described inline as "lib/ helpers" without their own context entry.
  • Add the Check list aggregate to the aggregates table in DOMAIN_MODEL2.md §6.
  • Move the anti-corruption layer analysis from this kb into DOMAIN_MODEL.md §8.1 (Import DAG layer rules) once the tracked tech-debt items are resolved.

Items this kb should keep mirroring from the in-package model on each pass:

  • Every new Pydantic class appearing in schema.yaml after regen_schema.sh is run should be checked against ENTITIES.md. If the class occupies a new domain role, add a Tier 2 or Tier 3 entry; if it is a refinement of an existing config sub-component, update the relevant Key attributes section and add a note.
  • The open questions in DOMAIN_MODEL2.md §9 should be reflected in the BOUNDED_CONTEXTS.md open questions section. When a question is resolved, update both.
  • The lifecycle section of DOMAIN_MODEL2.md §7 is the authoritative description of fold transitions and mode switches; kb pipeline pages (Phase P5) should cite it directly.

Regeneration policy: after regen_schema.sh is run and schema.yaml is updated, review ENTITIES.md for any new Pydantic classes or changed attribute sets. Attribute-level changes do not require an ENTITIES update unless they affect the entity's aggregate membership or a named invariant. New classes should be classified by tier before being added to ENTITIES; if their tier is ambiguous, add a note to the AGGREGATES open questions.