PARser for Content Extraction and Logical Structure
|What's PARCELS? |
Web documents that look similar often use different HTML tags to achieve their layout effect.
These tags often make it difficult for a machine to find text or images of interest.
PARCELS is a backend system [Java] designed to distinguish different components of a web site and parse it into a logical structure.
This logical structure is independent of the design/style of any website.
Each component in the structure will be given a tag revelant to the domain they are classified under.
For example, under the News Articles domain, some of the tags will be :
Title of article
Date/Time of article
Country where news occur
Images supporting contents of articles
Links supporting contents of articles
Main keywords of articles
Main content of articles
Supporting content of articles
Links to related articles
Click on the above images for a detailed demonstration.
From the logical structure, the system will be able infer the relations between components and extract the relevant fields of interest. e.g. main content, supporting links and so on.
An GUI interface will also be provided to demonstrate the usefulness of PARCELS.