How to extract useful content from HTML page

Dev - le 10 Juin 2015 par Mathieu WEBER

link http://gelembjuk.com/index.php?option=com_content&view=article&id=55:how-to-extract-useful-content-from-html-page&catid=39:text-mining&Itemid=57

For any HTML page it make sense only to look at body of the document - everything that is in .... tags. So we have top level node in the tree - tag body . My agloritm of extracting useful data is: Get root node (body tag) and put it for processing to recursively procedure that begins at step 2. Calculate length of clear text in node. Get list of subnodes for node.

If there is only 1 subnode then go to step 2 to process this subnode. For each subnode calculate length of clear text in the subnode. Get coefficient (K) of distribution of a text between subnodes. If K is low (<5%) it means that text is uniformly distributed between sobnodes and current node is exact main content of the page. If K is not low (>=5%) then choose the subnode with longest clear text , and apply the same procedure for this subnode (go to step 2). There is recursive call of content extracting function.

There are different ways to calculate coefficient of distribution of a text. And different value can be used as signal to stop. I used 5%. Let look how this agloritm will work with example above. First it will process node - body. Body has 1 subnode - table. So just take this subnode to procedure. Table has 2 subnodes - tr 2 times. One of then (second) has much more text then first. Coefficient of distribution of a text will be hight there . So, next node to process is second tr. Lets go deeper. The tr has 2 subnodes - td 2 times. The second has much more text. It will be used in next step. Td tag (current node) has few subnodes - p few times. In this case coefficient of distribution of a text will be low. Because, part of the text is in each p tag. So, at this step procedure will stop and return HTML code in this td as main (useful) content of the page.

Calculating of the coefficient (K).

Let L(i) is length if text in subnode number i. i=1..n . Where n is count of subnodes of node. FTL is length of all text in node (just text after removing HTML tags). LV is linear variance of list L(i) , i=1..n . And at last K=100*LV/FTL I have created simple tool to demonstrates how this works. It is written with Perl. Getting useful content of the Web page.