Extracting data (Parsing Rules)


Advanced mode: Text Block structures



The text block structure lets you single out a piece of source text for further processing. Any rules you apply will only be run within the bit of text that the text block denominates. Here is a text block definition:

Text Block Begin Text to start block: <tr. Text to end block: </tr>
   Get Text between <b> and </b> ...
Text Block End

It would single out the table rows (td's) text within the page (between <tr> and </tr>). The Get Text rule is applied within this text block, effectively harvesting the first bold text within the first tr found. After Text Block End, the page position is placed after the text to end block, in this case after "</tr>".

We can make things more interesting, and extract the bit of bold text from all table rows on the page. We would use:

Repeat Begin
   Text Block Begin Text to start block: <tr. Text to end block: </tr>
      Get Text between <td*> and </td> ...
   Text Block End
Repeat End

Now, HH singles out the first occurence of a table row (text between <tr> and </tr>), and within that text, it finds the first piece of bold text. Please open "example16.hhp", and give it a run. You will see that it only harvests the country name.

There is a important difference between a script with or without such a text block. If you were to remove the Text Block definition in the above script:

Repeat Begin
   Get Text between <td*> and </td> ...
Repeat End

and run this in HH (go ahead, remove the text block rules from the script!). Now it gives all data within a table data (td)! You simply get all data in the page between <td> and </td>, not just the country (= first TD within a TR).

Remark the smart use of the "*" in "<td*>" which enables you to catch a "<td>" but "<td valign=top>" as well.

 

Text blocks can not be nested, so use them effectively.

Next >