PARCELS
 
PARser for Content Extraction and Logical Structure

 
What's PARCELS?  
 
Web documents that look similar often use different HTML tags to achieve their layout effect. These tags often make it difficult for a machine to find text or images of interest.
 
PARCELS is a backend system [Java] designed to distinguish different components of a web site and parse it into a logical structure. This logical structure is independent of the design/style of any website.
 
Each component in the structure will be given a tag revelant to the domain they are classified under.
 
For example, under the News Articles domain, some of the tags will be :

  • Title of article
  • Date/Time of article
  • Reporter Name
  • Source Station
  • Country where news occur
  • Images supporting contents of articles
  • Links supporting contents of articles
  • Main keywords of articles
  • Main content of articles
  • Supporting content of articles
  • Links to related articles
  • Newsletter
     

     
    Click on the above images for a detailed demonstration.
     
    From the logical structure, the system will be able infer the relations between components and extract the relevant fields of interest. e.g. main content, supporting links and so on.
     
    An GUI interface will also be provided to demonstrate the usefulness of PARCELS.

  •  

         News
         What's PARCELS?
         Documentation
         Screen Shots
         Downloads
         Bug Reports
         Project Page
         News Annotator
         Contact Info

     
    SourceForge.net Logo

     
    National University Of Singapore School Of Computing Main Page MySoc Portal   Legal Statement