Extracting data (Parsing Rules)
 

Parsing rules in Advanced mode

Rules


Skip To Text between [skip target].

Skip To Text advances the text position forward until it encounters [skip target]. The cursor is placed thereafter.

Get Text between [text before target] and [text after target]. Column name for result: [column name].

Get text harvest all strings in the web page source between the [text before target] and [text after target]. Each result is placed on a new row in the same column.

Get Text Until read (store) text until [text after target] is found. Column name for result: [column name].

Get Text Until reads text from the current position forward until it encounters [text after target]. The read text is stored in the column [column name]. This is useful when, after a Skip To Text rule, you want to start extracting data directly from this text position. Get Text can not do this since you would need a  [text before target].

Get Next URL. Column name for URL: [column name]. Column name for Desrcription: [column name].

Get Next URL finds the next url (href definition) from the current page position. It stores the url itself (the href part) in the URL column and the description of the link (the part between <a href=...> and </a>) in the Description column. Url's are automatically completed to full path urls (i.e. http://www.server.com/img/img001.jpg instead of ../img001.jpg).
Get Next Picture of type [picture type]. Save files in folder [folder name]. Column name for results: [column name].

Get All Pictures downloads all pictures in <IMG> tags and saves it in [folder name] on your hard disk. You can save all IMG-es or just the type specified in [picture type]. The resulting file name is stored in column [column name].

Text Block: refer to the page explaining the Text Block here.
Repeat: refer to the page explaining the Repeat structures here.
Download Next Web Link: refer to the page explaining this rule here.
Save URL. Column name for result: [column name].

Simply saves the current page's url. Useful when you want to keep track from what page the original data was extracted.
Show Message with content [message text].

Pops up a message box with either Text Position, Last Harvested Text or Last Text Block content. Primarily used for debugging script. You can see if the Text Position advances in a repeat structure or inspect the contents of a text block.
 

Variables

skip target: the text directly preceding the point you want to skip to.
text before target: the text directly preceding the thing you want to find. Use text that is unique to this result (e.g. <td> is not a good string, make it more specific). You can use * to represent any string in between. In this fashion "<a href=*>*</a>" matches all hyperlinks in a document. To harvest all links in a web page for example, use Get Text between <a href=*> and </a>.
text after target: the text directly after the target you are trying to find.
column name: The name for the results column; appears above results. Use the same name to put results in the samed column.
picture type
: use ALL if you want all picture types, or JPG or any other one for just that type.
folder name
: a directory on your hard drive.
message text: pop up a message box with either Text Position, Last Harvested Text or Last Text Block content.


Next >