Use this URL to cite or link to this record in EThOS:
Title: Corpus and sentiment analysis
Author: Cheng, Tai Wai David
ISNI:       0000 0001 3542 4354
Awarding Body: University of Surrey
Current Institution: University of Surrey
Date of Award: 2007
Availability of Full Text:
Access from EThOS:
Access from Institution:
Information extraction/retrieval has been of interest to researchers since the early 1960's. A series of conferences and competitions have been held by DARPA/NIST since the late 1980's has resulted in the analysis of news reports and government reports in English and other languages, notably Chinese and Arabic. A number of methods have been developed for analysing `free' natural language texts. Furthermore, a number of systems for understanding messages have been developed, focusing on the area of named entity extraction, templates for dealing with certain kinds of news. The templates were handcrafted, and a lot of ad-hoc knowledge went into the creation of such systems. Seven of these systems have been reviewed. Despite the fact that IE systems built for different tasks often differ from each other, the core elements are shared by nearly every extraction system. Some of these core elements such as parser and part of speech (POS) tagger, are tuned for optimal performance for a specific domain, or text with pre-defined structures. The extensive use of gazetteers and manually crafted grammar rules further limits the portability of the existing IE systems to work language and domain independently. The goal of this thesis is to develop an algorithm that can be used to extract information from free texts, in our case, from financial news text; and from arbitrary domains unambiguously. We believe the use of corpus linguistics and statistical techniques would be more appropriate and efficient for this task, instead of using other approaches that rely on machine learning, POS taggers, parsers, and so on, which are tuned to work for a predefined domain. Based on this belief, a framework using corpus linguistics and statistical techniques, to extract information as unambiguously as possible from arbitrary domains was developed. A contrastive evaluation has been carried out not only in the domain of financial texts and movie reviews, but also with multi-lingual texts (Chinese and English). The results are encouraging. Our preliminarily evaluation, based on the correlation between a time series of positive (negative) sentiment word (phrase) counts with a time series of indices produced by stock exchanges (Financial Times Stock Exchange, Dow Jones Industrial Average, Nasdaq, S&P 500, Hang Seng Index, Shanghai Index, and Shenzhen Index) showed that when the positive (negative) sentiment series correlates with the stock exchange index, the negative (positive) shows a smaller degree of correlation and in many cases a degree of anti-correlation. Any interpretation of our result requires a careful econometrically well grounded analysis of the financial time series - this is beyond the scope of this work.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available