Extracting data (Parsing Rules)


Parsing rules in Normal mode


Rules


Get Text between [text before target] and [text after target]. Find [Number of instances] instance(s). Column name for results: [column name].

Get text harvest all strings in the web page source between the [text before target] and [text after target]. Each result is placed on a new row in the same column.
Get Table data from HTML table. Use table number [Table Number].

Gets all data from a particular html table. For instance, if is 2, it will harvest the second html table in the source. It will extract all data from that table and place it in the appropriate columns and rows. Very fast and easy function. It often works well to not even inspect the source for the table number, but just to start at number 1, hit Harvest and see if the data rolls in. If it doesn't, try table nr 2. And so on until you get the data you are after. 

Get Pictures between [text before target] and [text after target] of type [picture type]. Save files in folder [folder name]. Column name for results: [column name].

Get Pictures downloads all pictures in <IMG> tags and saves it in [folder name] on your hard disk. It only scans text placed between [text before target] and [text after target]. You can save all IMG-es or just the type specified in [picture type]. The resulting file name is stored in column [column name].
Get All Pictures of type [picture type]. Save files in folder [folder name]. Column name for results: [column name].

Get All Pictures downloads all pictures in <IMG> tags and saves it in [folder name] on your hard disk. You can save all IMG-es or just the type specified in [picture type]. The resulting file name is stored in column [column name].

Save url. Column name for results: [column name].

Records the url of the current web page that is being harvesting. The url is stored in column [column name]. It provides you with a reference which results are harvested from which web page (url).

Save source file in folder [folder name]. Column name for results: [column name].

Saves the complete web page source in a file on your hard disk. The file name is stored in column [column name]. Pages are named page0.html to page[n].html.


Variables

text before target: the text directly preceding the thing you want to find. Use text that is unique to this result (e.g. <td> is not a good string, make it more specific). You can use * to represent any string in between. In this fashion "<a href=*>*</a>" matches all hyperlinks in a document. To harvest all links in a web page for example, use Get Text between <a href=*> and </a>.
text after target: the text directly after the target you are trying to find.
column name: The name for the results column; appears above results. Use the same name to put results in the samed column.
Number of instances: the maximum number of results you want to find. Just put a large number if you want to find as many as possible (e.g. 10000).
picture type
: use ALL if you want all picture types, or JPG or any other one for just that type.
folder name
: a directory on your hard drive.
Table Number
: tells the function to harvest data from the n-th table in the html source, where n is the Table Number.


Next >