Extracting data (Parsing Rules)


Normal mode


A parsing rule is applied from the beginning of the page and, depending on the rule, repeated a number of times on that page. Any next rule is also applied from the beginning of the page. Effectively, you search through the page for a first piece of information (all names for example), and then for another one (all titles for example).

Let us do a simple harvest (see HH Example directory, "example1.hhp"). In Data Source (Web page by next button) put "http://www.happyharvester.com/example1/". Set Parsing rules to Normal mode and click the Add rule button. The rule will read "Get Text", that is precisely what we need.
Now click the Preview button in Data Source. The page loads and you can see the source code to the right. Search the text for "Afghanistan". We want to extract this text from the page. The way to do this is to choose text before and after "Afghanistan" and let HH use this to find the country text. Make the Get Text rule read as follows: "Get Text bewtween <td valign="top"> and </td>. Find 1 instance. Column name for results: Country".
Next, press Start Harvesting. You will see that "Afghanistan" is put in the column called "Country".
Now, change "Find 1 Instance" to "Find 1000 instances". This will direct HH to harvest at most 1000 results or as many as it can find. Press Start Harvesting again. Now see that the Country column is filled with all countries on this web page.

Next, let us try to harvest population information as well from this web page. Add an extra rule and make it read "Get Text between <td class="normal"> and </td>. Find 1000 instances. Column name for results: Population". Click Start Harvesting. Voila! HH first harvests the country, and then harvest the population information for you.
 

Contents | Next >