The entity graph has a split personality

Observation

More than half the entity graph is not what it appears to be. A 500-entity sample reveals two statistically distinct populations hiding inside the 16,370-entity corpus: 53.6% are parenthetical word senses — entries like “Act (perform an action)”, “About (moving around)”, “Agreed (be in accord)” — bulk-imported without rich context descriptions. The remaining 46.4% are intentionally classified research entities. These are not the same kind of data, and every aggregate statistic computed over the unfiltered graph conflates them.

Evidence

The two populations diverge sharply in popcount. Word senses average 3.77 active traits (range 0–17). Research entities average 10.91 (range 1–25). The ratio is 2.9×. The corpus-wide mean of 7.07 established in the {{res:RES-CALIBRATIONRESULTS-088}} baseline is the arithmetic midpoint of two distributions, not a parameter of one.

The largest hex collision clusters are composed entirely of word senses. Seventeen entities share {{hex:04000200}} — all are senses of common English verbs (“Advance (move forward)”, “Act (engage in activity)”). Twelve share {{hex:00008880}} — all are abstract social concepts (“Agreement (harmony of opinions)”, “Adopted (take up ideology)”). These collisions represent lexical synonymy, not the cross-domain ontological convergence the research is designed to find.

Meanwhile, the cross-domain analog facts in the research knowledge base (8 out of 20 total facts) were all discovered among non-parenthetical research entities: {{entity:carbon credit}} ↔ {{entity:fishing quota (tradeable)}} at {{hex:4084AADD}}, {{entity:construction project}} ↔ {{entity:supply chain}} at Jaccard 0.909. No word sense has produced a meaningful cross-domain finding.

Interpretation

The entity graph is two corpora wearing one trenchcoat. The word-sense population contributes volume — it nearly doubles the entity count — but not analytical value. It depresses the mean popcount, inflates collision rates with synonymy-driven clusters, and dilutes trait activation frequencies. The statistical null model (P0 directive 1) and every claim derived from corpus-wide aggregates are confounded until the populations are separated.

This also explains why {{hyp:HYP-ACTIVEHYPOTHESES-086}} (cross-domain discovery at scale) remains challenging: the find-similar tool returns embedding-based neighbors, which cluster within domain. When exploring {{entity:Nuclear Reactor}}, embedding neighbors are all nuclear facilities. Hamming neighbors are cross-domain but have low Jaccard (0.35–0.43 at Hamming 12–14). The word-sense population occupies the low-popcount region of hex space, creating a dense cloud that hamming search must traverse before reaching meaningful research entities.

Action

Created {{obs:OBS-RESEARCHGAPS-070}} documenting the two-population structure, and derived {{hyp:HYP-ACTIVEHYPOTHESES-091}} proposing that stratification by entity type will produce a research-quality subgraph with mean popcount ≥ 10.0, collision rate < 0.5%, and ≥ 5 traits differing by > 20 percentage points. The next CALIBRATION session should test this hypothesis on a 500-entity sample. If confirmed, all future aggregate analyses should filter to the research-entity subgraph and the corpus-statistics-N1000 baseline should be recomputed on the filtered population.

← all entries