Extracting data (Parsing Rules)


General Tips and Tricks


The use of the asterisk (*)

You can use the * wildcard in Normal Mode (Get Text, Get Pictures rules) as well as Advanced Mode (Get Text, Skip To Text, Get Text Until, Text Block rules). An asterisk skips zero or more characters. For example <a href=*> matches <a href="link.html"> with the * matching "link.html".

You can also use more than one asterisk. For example <a href=*>*</a> matches <a href="link.html">My Link</a> with the first * matching "link.html"  and the second one My Link.

The asterisk is used frequently in the Get Text rule. Suppose you have the following bit of HTML:
<td valign=top><font color=blue>first text to extract</font></td>
<td valign=left><b>second text to extract</b></td>

and we want to extract first text to extract and second text to extract. You can achieve this use the following rule:
Get Text between <td *><*> and </*></td>
This will extract the text regardsless of what is put in the td definition and regardless of font or b definitions (or any definiton for that matter).

Debugging

If during building of a script you want to see a slow motion play of how it harvests, go to General Settings and mark the option Indicate current row while harvesting. Rules will now highlite as they execute, making it easier to see what happens. Make sure you unmark the option before a real harvest as this tends to slow a harvest considerably. 


Next >