← Back to Papers

CLEAVER: Hierarchical Text Partitioning & Analysis

A Novel Approach to Semantic Parsing and Text Analysis

Muhammad Fusenig | 2023

Abstract

CLEAVER introduces a novel, hierarchical approach to English text categorization and partitioning. Focusing on independent semantic value, it distinguishes itself from traditional syntactic parsers by offering an expansive range of linguistic units for text analysis. By identifying the largest meaningful span of tokens within a text, CLEAVER's hierarchical framework provides a novel system for textual segmentation to be used in text generation and analysis.

Background

Recent years have seen Transformer-based Neural Networks replace rule-based, statistical models in Natural Language Processing (NLP) and semantic parsing. Despite these advancements, there remains no comprehensive framework for semantic analysis, making it difficult to translate machine learning advancements to educational technologies. Traditional syntactic parsers provide limited insight into the semantic structure of text, and researchers seeking semantic analysis often construct custom partitioning systems or use specialized transformer models.

CLEAVER Hierarchical Framework

Figure 1: Visualization of CLEAVER's hierarchical text partitioning approach

Importance

CLEAVER addresses several critical limitations in current NLP approaches:

By providing a hierarchical semantic partitioning system, CLEAVER enables better analysis of text data itself, allowing language models to be specialized for specific writing tasks such as analyzing semantic patterns in student writing or argumentation.

Technical Implementation

CLEAVER draws from the principle of compositionality in Compositional Semantics, by which the aggregate meaning of a phrase is determined by the composition of its subphrases. The system includes:

The system begins partitioning at the highest level of compositional meaning (compound-complex sentences) and continues down to simple sentences, noun complexes, and eventually the word-token level.

CLEAVER Semantic Idea-Units

CLEAVER Hierarchical Structure

Figure 2: Hierarchical structure of text partitioning in CLEAVER

Attribute (Abbreviation) Description
Compound-Complex Sentence (CCS) Contains multiple Nominal Subjects (SH), indefinite Verb-Object units (SB), and indefinite Asides (A).
Compound Sentence (CS) Contains multiple Nominal Subjects (SH), indefinite Verb-Object units (SB), and no Asides (A).
Complex Sentence (XS) Contains a singular Nominal Subjects (SH), indefinite Verb-Object units (SB), and indefinite Asides (A).
Aside (A) Extraneous information signified by surrounding punctuation.
Sentence (S) Composition of a Structural Phrase, Nominal Subject (SH), and an indefinite number of Verb-Object units.
Structural Phrase (SP) Adjectival, Adverbial, Prepositional, or Nominal information that precedes the Nominal Subject (SH).
Simple Sentence (SS) Composition of one Nominal Subject (SH) and an indefinite number of Verb-Object units.
Sentence Head (SH) Nominal Subject of a Simple Sentence.
Sentence Body (SB) Containing at least one Verb, the total span of Verb-Object units within a Simple Sentence.

Table 1: Core semantic idea-units in CLEAVER's partitioning system (showing 9 of 30 total units)

Integration with Lexical-Semantic Resources

Future development of CLEAVER will include integration with cognitive and psychological word categories from established lexical resources to enhance semantic analysis capabilities:

This integration will enable CLEAVER to not only partition text according to its hierarchical semantic structure but also to estimate underlying psychological and cognitive dimensions of the text, bridging computational linguistics with cognitive psychology.

Key Conclusions

  1. CLEAVER offers a novel framework for continued research in optimizing small transformer-based models for text generation and analysis.
  2. By segmenting texts into attributable semantic categories, CLEAVER determines probabilistic relationships at the semantic level, increasing statistical relationships between words and phrases not typically seen in limited training datasets.
  3. This approach increases compositional variability and prompting control in text generation by allowing models to select from semantic structures before generating text.
  4. CLEAVER negates statistical bias within training sets and strengthens relationships between semantic categories, forcing machines to "think" beyond words and expand acceptable grammars.
  5. The system provides researchers greater control over text composition in generative models and enables specialized analysis of specific writing tasks.
  6. The integration with cognitive and psychological lexical resources will further enhance CLEAVER's capability to constrain and estimate semantic meaning, bridging computational linguistics with cognitive psychology.

Further validation of CLEAVER's utility within text generative models and analysis is needed through continued research in text prompting and categorization.