The graph
PIPES uses a graph to represent the integrated modeling pipeline. All activity related to an integrated modeling project (e.g., project, models, datasets, tasks, etc.) are represented as nodes in the graph and their relationships are represented by edges. Metadata associated with project activities are stored as properties on the nodes and edges in the graph database and are used by PIPES to query the graph, run validations, check status, send notifications, etc.
Here we show an example of a PIPES project in the graph database that contains all node and edge types.
Graphs are well suited to represent entity relationships and data flows, making them perfect for data pipelines. Graphs also have the advantage of supporting unstructured querying and query evolution. This means that as the requirements of a project evolve and new types of queries need to be performed, the graph structure can easily adapt to these changes without requiring a complete overhaul of the system.
Graph levels
The PIPES graph is comprised of two levels which distinguish the planning (Level 1) and operational activity (Level 2) phases of a project.
Level 1:
- Highest-level of the graph.
- Represents the planning details of a project.
- Nodes include Project , Project Run , Model , edges include and handoff expectations and scheduling
- Initialized after a PI/data manager submits a Project Config
Level 2:
- Lower-level of the graph
- Represents the operational activities of a project
- Nodes include Model Run , Dataset , and Task ( QAQC , Visualization , Transformation )
- The Model Run Config initializes Level 2 activity in the graph for each Model Run
- Other configs that initialize lower level activities include Dataset Config , Task Creation , and Task Planning configs
Although PIPES was designed to manage the integrated modeling projects from planning to execution, users can decide to only interface with Level 1 for planning purposes only (if desired).
Warning
At this stage, PIPES does not currently have smart update capabilities. If a configuration file needs to be updated, the graph pipeline gets regenerated. Project pipeline update is under development.
Node types
There are 8 node types in the PIPES graph:
Node type | Description | Level |
---|---|---|
Project | The root node for any PIPES project (e.g., ”LA100”). | 1 |
Project Run | Represents different scrum-like phases of a project (e.g., “LA100 Run 0”, ”LA100 Run Final”). It may have different requirements, assumptions, or scenarios from the project (e.g., might be a subset of scenarios or model-years considered). | 1 |
Model | Represents different modeling activities (e.g., “dsgrid”). A descendant of a project run, part of the actual pipeline topology. | 1 |
Model Run | An instance of running a model with certain specifications | 2 |
Dataset | The artifacts that are part of final project results or are produced/consume by other models in the project | 2 |
Task | Represents a task performed on one or more datasets. Task types include transformation, QAQC, and visualization | 2 |
Modeling Team | Node which is the parent of a modeling, consisting of users | N/A |
User | The human participants in a project. | N/A |
Edge types
Edges between nodes of specific types have unique names that describe the relationship between the nodes. Although some edge type names are synonyms (e.g., generated & output), the unique names makes it easier to query the graph for specific node relationships.
Somewhere we need to capture node/edge properties. Esp. as it relates to requirements, scenarios, assumptions, delegates, scheduling.
The handoffs
A key feature of PIPES is the data handoff management between two models in an integrated modeling project. PIPES manages these data handoffs across both the planning (Level 1) and the operations (Level 2) of a project. In general, handoffs should be specified for each dataset a model plans to provide other models, and for each unique exchange of that dataset.