|
Extracting
data (Parsing Rules)
Normal mode
A
parsing rule is applied from the
beginning of the page and, depending on
the rule, repeated a number of times on
that page. Any next rule is also applied
from the beginning of the page.
Effectively, you search through the page
for a first piece of information (all
names for example), and then for another
one (all titles for example).
Let us do a simple harvest (see HH
Example directory, "example1.hhp"). In
Data Source (Web page by next button)
put "http://www.happyharvester.com/example1/".
Set Parsing rules to Normal
mode and click the Add rule
button. The rule will read "Get Text",
that is precisely what we need.
Now click the Preview button in Data
Source. The page loads and you can see
the source code to the right. Search the
text for "Afghanistan". We want to
extract this text from the page. The way
to do this is to choose text before and
after "Afghanistan" and let HH use this
to find the country text. Make the Get
Text rule read as follows: "Get Text
bewtween <td valign="top"> and </td>.
Find 1 instance. Column name for
results: Country".
Next, press Start Harvesting. You
will see that "Afghanistan" is put in
the column called "Country".
Now, change "Find 1 Instance" to "Find
1000 instances". This will direct HH to
harvest at most 1000 results or as many
as it can find. Press Start
Harvesting again. Now see that the
Country column is filled with all
countries on this web page.
Next, let us try to harvest population
information as well from this web page.
Add an extra rule and make it read "Get
Text between <td class="normal"> and
</td>. Find 1000 instances. Column name
for results: Population". Click Start
Harvesting. Voila! HH first harvests
the country, and then harvest the
population information for you.
Contents | Next >
|