Extracting data (Parsing Rules)


Advanced mode

Introduction


The first parsing rule is applied from the beginning of the page and, depending on the rule, the position in the page after the rule is applied is remembered. Any next rule is thus applied from the previous page position. This enabled you to do complex data extraction. We will start with a simple example.

In HH, open from the examples directory the profile "example13.hhp". Do a preview so you will get an idea of the page source. You will see one Get Text rule, similar as we saw in Normal Mode. Then press Start Harvesting. You will see one country name appear. Why only one? In normal mode, HH produces all strings found between [text before target] and [text after target]. In Advanced mode, it extracts only one string and remebers the position after [text after target]. If you tell HH to repeat this action, we will harvest all countries. Open profile "example14.hhp" to do this. Press Start Harvesting again and you will see all countries.

Now suppose we want to harvest all coutries but coutries starting with an A. We can use the Skip To Text rule to accomplish this. Open profile "example15.hhp". The first rule skips to "Austria". The text position is now placed after this country. The following rule in the repeat statement will now continue from this position, and you will have skipped all countries starting with an A. Skip To Text is one way to control what information you want to harvest and what not. Use Skip To Text "<body" for example to start harvesting from the BODY part of the HTML, skipping the HEAD. Or use Skip To Text "<table" twice to skip the first two tables and go to the third. This way you can start harvesting the third table only. These kind of mechanisms are what make the Advanced Mode extremely powerful - and there is more to come.
 

Next >