Use this URL to cite or link to this record in EThOS:
Title: Grammar-based preprocessing for PPM compression and classification
Author: Aljehane, Nojood Obaidallah M.
ISNI:       0000 0004 7432 2117
Awarding Body: Bangor University
Current Institution: Bangor University
Date of Award: 2018
Availability of Full Text:
Access from EThOS:
Access from Institution:
The aim of this study is to investigate the efficiency of novel methods using context-free grammars and Prediction by Partial Matching (PPM) in order to build and evaluate the quality of compression models for text files such as English, Arabic, Persian,Welsh and Chinese. A further aim is then to apply these models to the problem of the classification of text to see how well they perform at this application. We apply grammar-based pre-processing prior to using the PPM compression algorithm. The methods achieve significantly better compression for different natural language texts compared with other well-known compression methods. Our method first generates a grammar based on the most common two-character sequences (bigraphs) or three-character sequences (trigraphs) in the text being compressed and then substitutes these sequences using the respective non-terminal symbols defined by the grammar in a pre-processing phase prior to compression. We describe further improvements using a two-pass scheme where grammar-based preprocessing is applied again in a second pass through the text. We then apply the algorithms to the files in the Calgary Corpus and also achieve significantly improved results in compression when compared with other compression algorithms, including a grammar-based approach known as the Sequitur algorithm. Despite the advances of the PPM method in predicting upcoming symbols or words in the English language, more research is required to devise better compression methods for other languages, such as Arabic due to, for example, the rich morphological nature of Arabic text, in which a single word can take many different forms. In this dissertation, we propose a new method that achieves the best compression rates not only for Arabic language text but also for other languages that use Arabic script in their writing systems, such as Persian. Our word-based method (GRW-PPM) constructs a contextfree grammar for the text; this grammar is then encoded using PPM to achieve excellent compression rates. Finally, we investigate the classification of genre in English and Arabic text by using our new character-based text compression scheme (GRB-PPM). Experimental results on a parallel Arabic and English corpus show that our new method is very effective compared with traditional compression-based classification methods. We have also confirmed that good compression leads to good classification.
Supervisor: Teahan, William Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: Computer Science