← Back to Projects

CLEAVER

Hierarchical Text Partitioning & Analysis

Cleaver Interface

Abstract

CLEAVER introduces a novel, hierarchical approach to English text categorization and partitioning. Focusing on independent semantic value, it distinguishes itself from traditional syntactic parsers by offering an expansive range of linguistic units for text analysis. By identifying the largest meaningful span of tokens within a text, CLEAVER's hierarchical framework provides a novel system for textual segmentation to be used in text generation and analysis.

Background

Recent years have seen Transformer-based Neural Networks replace rule-based, statistical models in Natural Language Processing (NLP) and semantic parsing. Despite these advancements, there remains no comprehensive framework for semantic analysis, making it difficult to translate machine learning advancements to educational technologies. Traditional syntactic parsers provide limited insight into the semantic structure of text, and researchers seeking semantic analysis often construct custom partitioning systems or use specialized transformer models.

Importance

CLEAVER addresses several critical limitations in current NLP approaches:

By providing a hierarchical semantic partitioning system, CLEAVER enables better analysis of text data itself, allowing language models to be specialized for specific writing tasks such as analyzing semantic patterns in student writing or argumentation.

Technical Implementation

Python NLTK spaCy NLP Compositional Semantics Hierarchical Processing

CLEAVER draws from the principle of compositionality in Compositional Semantics, by which the aggregate meaning of a phrase is determined by the composition of its subphrases. The system includes:

The system begins partitioning at the highest level of compositional meaning (compound-complex sentences) and continues down to simple sentences, noun complexes, and eventually the word-token level.

CLEAVER Semantic Idea-Units

Cleaver Algorithm Diagram

Hierarchical structure of text partitioning in Cleaver

Attribute (Abbreviation) Description
Compound-Complex Sentence (CCS) Contains multiple Nominal Subjects (SH), indefinite Verb-Object units (SB), and indefinite Asides (A).
Compound Sentence (CS) Contains multiple Nominal Subjects (SH), indefinite Verb-Object units (SB), and no Asides (A).
Complex Sentence (XS) Contains a singular Nominal Subjects (SH), indefinite Verb-Object units (SB), and indefinite Asides (A).
Aside (A) Extraneous information signified by surrounding punctuation.
Sentence (S) Composition of a Structural Phrase, Nominal Subject (SH), and an indefinite number of Verb-Object units.
Structural Phrase (SP) Adjectival, Adverbial, Prepositional, or Nominal information that precedes the Nominal Subject (SH).
Simple Sentence (SS) Composition of one Nominal Subject (SH) and an indefinite number of Verb-Object units.
Sentence Head (SH) Nominal Subject of a Simple Sentence.
Sentence Body (SB) Containing at least one Verb, the total span of Verb-Object units within a Simple Sentence.
Appendage (A) Individual Verb-Object units within a Sentence Body and in reference to the Sentence Head.
Appendage Head (AH) The Verb, or set of actions, conducted by the Nominal Subject (SH).
Appendage Body (AB) The object of the set of actions specified in the Appendage Head.
Subordinating Clause (SC) A Clause introduced by a Subordinating Conjunction.
Compound-Complex Noun (CCN) Two or more Nouns joined together by a Linking Phrase and a Preposition.
Compound Noun (CN) Two or more Nouns joined together by a Linking Phrase.
Complex Noun (XN) Two or more Nouns joined together by a Preposition.
Noun (N) Composition of a Qualifier, Adjectival Phrase, and a Simple Noun.
Modified Noun (MN) An Adjectival Phrase linked to a Simple Noun.
Simple Noun (SN) An individual Pronoun, Plural, or Singular Noun.
Compound-Complex Verb (CCV) Two or more Verbs, one of which consist of two or more non-auxiliary verbs, that are joined by a Linking Phrase.
Compound Verb (CV) Two or more Verbs joined by a Linking Phrase.
Complex Verb (XV) Two or more sequential non-auxiliary verbs.
Verb (V) A descriptive action or event that is modified by an auxiliary and adverb.
Modified Verb (MV) An event or action modified by an Adverb.
Simple Verb (SV) An individual event or action not modified by an auxiliary or adverb.
Auxiliary Verb (AV) Modifies a verb and provides additional semantic and grammatical meaning.
Prepositional Phrase (PP) A descriptive phrase beginning with a Preposition.
Adjective Phrase (AP) One or more adjectives that modify a Singular Noun.
Qualifier (Q) Possessive Pronoun or Determinate that comes before a Noun.
Linking Phrase (LP) Punctuation or Conjunction that links two or more distinct objects.

Complete corpus of 30 semantic idea-units in CLEAVER's partitioning system

Integration with Lexical-Semantic Resources

Future development of CLEAVER will include integration with cognitive and psychological word categories from established lexical resources to enhance semantic analysis capabilities:

This integration will enable CLEAVER to not only partition text according to its hierarchical semantic structure but also to estimate underlying psychological and cognitive dimensions of the text, bridging computational linguistics with cognitive psychology.

Key Conclusions

  1. CLEAVER offers a novel framework for continued research in optimizing small transformer-based models for text generation and analysis.
  2. By segmenting texts into attributable semantic categories, CLEAVER determines probabilistic relationships at the semantic level, increasing statistical relationships between words and phrases not typically seen in limited training datasets.
  3. This approach increases compositional variability and prompting control in text generation by allowing models to select from semantic structures before generating text.
  4. CLEAVER negates statistical bias within training sets and strengthens relationships between semantic categories, forcing machines to "think" beyond words and expand acceptable grammars.
  5. The system provides researchers greater control over text composition in generative models and enables specialized analysis of specific writing tasks.
  6. The integration with cognitive and psychological lexical resources will further enhance CLEAVER's capability to constrain and estimate semantic meaning, bridging computational linguistics with cognitive psychology.

Further validation of CLEAVER's utility within text generative models and analysis is needed through continued research in text prompting and categorization.