Skip to content

Data Model

clusters / dimensions
api_resources
objects
versions + blobs
latest_raw_index
latest_index
object_edges
object_facts
object_changes
ingestion_offsets
maintenance_runs

Use numeric IDs internally to reduce repeated text in large fact and edge tables.

clusters(id, name, uid, source, created_at)
api_resources(
id,
api_group,
api_version,
resource,
kind,
namespaced,
preferred_version,
storage_version,
verbs
)
object_kinds(id, api_resource_id, api_group, api_version, kind)
edge_types(id, name)
fact_keys(id, family, key, value_type)
resource_processing_profiles(
api_resource_id,
profile,
retention_class,
filter_chain,
extractor_set,
compaction_strategy,
priority,
max_event_buffer,
enabled
)

api_resources is the discovery-backed GroupVersionResource table. It is the authoritative mapping used by the collector and authorizer when they need the Kubernetes group/resource/scope tuple for list/watch and SAR/SSAR checks. object_kinds can remain as a compact dimension for query tables, but it must link back to api_resources.

Required fields for authorization and watch management:

api_group
api_version
resource
kind
namespaced
verbs

Optional discovery metadata:

preferred_version
storage_version
last_discovered_at
removed_at

resource_processing_profiles stores per-resource behavior. The default profile uses the generic history path. High-volume or high-value resources can use a specialized profile without changing the logical storage contract. The SQLite default backend writes a profile row during discovery and first observation storage, then reports API resource count and stored version count by profile in validation output.

Initial profiles:

generic
pod_fast_path
node_summary
event_rollup
endpointslice_topology
lease_skip_or_downsample
secret_metadata_only

Required default edge_types:

service_selects_pod
endpointslice_targets_pod
pod_owned_by_replicaset
replicaset_owned_by_deployment
pod_on_node
pod_uses_configmap
pod_uses_secret
workload_uses_pvc

Required default fact_keys:

pod_status.reason
pod_status.last_reason
pod_status.restart_count
pod_status.ready
pod_status.phase
pod_status.qos_class
pod_placement.node_assigned
pod_placement.started
pod_placement.deleted
workload_rollout.deployment_generation
workload_rollout.replicaset_hash
workload_config.image
workload_config.memory_request
workload_config.memory_limit
workload_config.cpu_request
workload_config.cpu_limit
workload_config.probe_changed
node_condition.Ready
node_condition.MemoryPressure
node_condition.DiskPressure
node_condition.PIDPressure
node_status.taint
node_status.capacity
node_status.allocatable
k8s_event.type
k8s_event.reason
k8s_event.message_fingerprint
k8s_event.message_preview
k8s_event.action
k8s_event.reporting_controller
k8s_event.reporting_instance
k8s_event.count
k8s_event.series_count
service.type
service.cluster_ip
service.load_balancer.pending
service.load_balancer.ingress_count
service.load_balancer.ingress_ip
service.load_balancer.ingress_hostname
service.deleted
endpoint.ready
endpoint.serving
endpoint.terminating
endpoint.membership

objects is the stable identity table.

objects(
id,
cluster_id,
kind_id,
namespace,
name,
uid,
latest_version_id,
first_seen_at,
last_seen_at,
deleted_at
)

Identity rules:

  • Prefer Kubernetes metadata.uid when available.
  • Use cluster/kind/namespace/name as the human-readable key.
  • Track delete/recreate as different objects if UID changes.
  • For resources without UID, fall back to namespaced identity.
  • deleted_at is the time kube-insight observed a Kubernetes delete event for the object. It is not copied from metadata.deletionTimestamp, which records Kubernetes graceful deletion intent before the object is actually removed.

versions stores the reconstructable resource history.

versions(
id,
object_id,
seq,
observed_at,
resource_version,
generation,
doc_hash,
materialization,
strategy,
blob_ref,
parent_version_id,
raw_size,
stored_size,
replay_depth,
summary
)

materialization values:

full
reverse_delta
cdc_manifest

strategy values:

full_zstd
json_patch_zstd
cdc_zstd
blobs(
digest,
codec,
raw_size,
stored_size,
data
)

The blob layer should be content-addressed. It can later move from SQL storage to object storage without changing the logical model.

object_facts, object_edges, and object_changes are derived from retained JSON versions. When extractor sets or resource profiles change, run kube-insight db reindex to rebuild those derived rows from versions and blobs without re-watching the cluster. The command is dry-run by default; use --yes to apply changes in small object batches.

Latest data is split into two query surfaces:

  • latest_raw_index / latest_raw_documents: latest observed sanitized cluster snapshot. This preserves runtime fields such as resourceVersion, generation, Event counters, and controller heartbeat values. Secret payload values are still redacted; key names can be retained.
  • latest_index / latest_documents: latest retained history proof. This points at the newest normalized versions row and can intentionally omit high-churn fields filtered before retained hashing.
latest_raw_index(
object_id,
cluster_id,
kind_id,
namespace,
name,
uid,
observed_at,
observation_type,
resource_version,
generation,
doc_hash,
raw_size,
doc
)
latest_index(
object_id,
cluster_id,
kind_id,
namespace,
name,
uid,
latest_version_id,
observed_at
)

Use latest_raw_documents when a human or agent needs the current observed cluster shape. Use latest_documents when the question needs the latest retained proof document. latest_index remains rebuildable from versions; latest_raw_index is overwritten by future observations and is not historical proof. Deleted objects are removed from latest_raw_index; delete history remains available through observations and retained versions.

object_edges stores time-valid graph edges.

object_edges(
id,
cluster_id,
edge_type,
src_id,
dst_id,
valid_from,
valid_to,
src_version_id,
dst_version_id,
confidence,
detail
)

open_edges tracks currently active edges for efficient ingestion:

open_edges(
cluster_id,
edge_type,
src_id,
dst_id,
edge_id
)

Only write edge rows when relationships change.

object_facts stores queryable incident evidence.

object_facts(
id,
cluster_id,
ts,
object_id,
version_id,
kind_id,
namespace,
name,
node_id,
workload_id,
service_id,
fact_key_id,
fact_value,
numeric_value,
severity,
detail
)

Keep detail small. Full JSON belongs in versions.

object_changes stores small timeline entries used by the UI and investigation ranking.

object_changes(
id,
cluster_id,
ts,
object_id,
version_id,
change_family,
path,
op,
old_scalar,
new_scalar,
severity
)

This is not the full diff. It is a query aid.

ingestion_offsets(
cluster_id,
api_resource_id,
namespace,
resource_version,
last_list_at,
last_watch_at,
last_bookmark_at,
status,
error,
updated_at
)

For cluster-scoped resources, namespace is null. For namespaced resources, namespace is null for an all-namespaces watch and set to the namespace name for a namespace-scoped watch. Offsets let the collector resume and make gap detection explicit.

The same table powers watch health:

kube-insight db resources health --errors-only
kube-insight db resources health --stale-after 5m

Health output is intended for humans, automation, and agents to decide whether an evidence answer is based on fresh complete watch data or a partial/stale resource stream.

High-churn watch ingestion creates dead rows and index bloat in SQL backends. Maintenance policy is part of the data model, not an operational afterthought.

Track maintenance runs:

maintenance_runs(
id,
cluster_id,
backend,
task,
started_at,
finished_at,
status,
rows_scanned,
rows_changed,
bytes_before,
bytes_after,
error
)

Required tasks:

compact_versions
compact_edges
purge_retention
rebuild_derived_indexes
vacuum_or_analyze

SQLite should run incremental vacuum when enabled, wal_checkpoint, and ANALYZE after large ingestion or purge jobs. ClickHouse should report active and inactive part footprint, compression ratio, and merge pressure during live profiles. Future OLTP metadata backends such as PostgreSQL or CockroachDB should use their native vacuum/analyze or bloat-management workflows if they are added.