Skip navigation.

A Generic Framework that produces structured information from semi-structured data

Mr. Eldon Marks, Mr. Nicholas Glasgow, Mr. Roger Nurse
Information Extraction is a task required in the day to day operations of many businesses, particularly, for data analysis or data storage. There are many techniques that aims to perfectly extract information regardless of domains. Previous research showed the popularity of rule based techniques for implementations that are capable of working across domains. However, it was revealed that there is, at most times, a requirement to specify sophisticated rules that we thought would be difficult for regular users. This study presents a generic framework (GIEF) that combines a dictionary-based rule learning technique, along with Natural Language Processing tools to extract information from semi-structured documents. The frameworks exploits the document’s structure, and requires actual words and simple tags as a seed list, when specifying rules. An evaluation of the framework revealed it does have the potential of success and may perform in other domains just as well as it did in this context.
information extraction, seed word list, dictionary, domain-adaptive, natural language processing