IE_ExtractTag access violation module ntdll

Optimus · Post by **Optimus** » Thu Mar 11, 2010 7:29 am

Hi,

I have a script (v11.1) that trolls through a Websphere-developed site
and extracts data from a HTML table. The script has been running fine
for about a year but in the last 24 hours has started failing with
this error:

Access Violation at xxx in module ntdll.dll. Read of address
yyy.

I have also tried it on a different machine (both Windows XP SP3) with the
latest version v11.1.22 and found the same behaviour. I'm guessing
something has changed in the HTML input that is causing the script to barf.

Using the debugger, the error seems to occur when using IE_ExtractTag. I have wrapped this function in a subroutine as follows, where
Get_Cell_Text_Var_1 is the cell number I wish to capture:

Code: Select all

SRT>Get_Cell_Text
  IE_ExtractTag>%IE[0]%,,TD,Get_Cell_Text_Var_1,0,cell_text,r
END>Get_Cell_Text

Four cells from the Websphere-generated HTML table are included
below. The first and second TD text values are captured correctly.
The attempt to capture the third cell actually returns the value from
the second cell. The call to get the fourth cell results in the
access violation.

12/03/10 09:00:00

131091

Flyer - 1Pp Or 2Pp

CLK3 offset colour between 1,000 and 10,000

Interestingly, I've also looked at the HTML table with webrecorder and
the tag extraction tool, and they seem to identify the TD values
unambiguously.

Any ideas?

Thanks.

Optimus · Post by **Optimus** » Thu Mar 11, 2010 11:38 pm

Can anyone confirm if there is a string or buffer limit when working with IE_ExtractTag? If so, is there a way to increase it?

Looking at the second table cell from my previous post, and comparing it with a similar instance from about a year ago, I calculated the new cell is 5 characters longer.

New cell value: 849 characters (923 if TD tags included)
Old cell value: 844 characters (918 if TD tags included)

I have no idea if this is the cause of the memory violation.

Post by **Marcus Tettmar** » Fri Mar 12, 2010 8:50 am

Set the buffer size like this:

Let>cell_text_SIZE=1024
IE_ExtractTag>%IE[0]%,,TD,Get_Cell_Text_Var_1,0,cell_text,r

That will set the buffer size to 1024 characters.

I can't see your HTML source properly as it has messed up the forum. Could you try again, this time placing it inside [code] .... [/code] tags (use the Code button) and disable HTML in the post.

Optimus · Post by **Optimus** » Fri Mar 12, 2010 1:16 pm

Sample of HTML input posted again, as requested.

Code: Select all

<TD name="ColumnData" valign="top" width="200" align="Center"><SPAN name="RFQCloseDate" class="outputData">12/03/10 09:00:00</SPAN><SPAN name="RFQCloseDate_ValidationError" class="ValidationErrorText"></SPAN></TD>

<TD name="ColumnData" valign="top" width="200" align="Center"><a name="RFQNumber" class="outputData" href="RFQ_Details_Method" onclick="var _f=(this.form || _bst_locateForm_Supplier_00215SQM_00515Model_005151274bd36ce6_005151842());var _els = _f.elements; _els["RfqNo"].value = "131091"; _f.action = "/wps/myportal/streamsolsportal/!ut/p/c1/04_SB8K8xLLM9MSSzPy8xBz9CP0os_ggZx9HCydDRwMLM1MXAyMXA8sgIydHY3cnY6B8JJK8QZi_uYGRqYGTi0GQs7GXkwkB3eEg-_DrB8kb4ACOBvp-Hvm5qfoFuREGWSaOigDuPdir/dl2/d1/L0lDU0NTSUpKZ2tLQ2xFQSEvb01vUUFBSVFKQUFNWXhpbE1RWndYQk00L1lCSkp3NDU0NTAtNUY0a3N0eWp3LzdfNVJPUUFCMUEwTzhHMDAyQlFQTDJGUDA0SjcvYUE0WUYxMDUvYmZfYWN0aW9uL19nZW5fY2FsbF8xX1JGUV9EZXRhaWxzX01ldGhvZA!!/#7_5ROQAB1A0O8G002BQPL2FP04J7";_f.target = "_self";if (!_f.onsubmit || _f.onsubmit()) _f.submit();return false">131091</a><SPAN name="RFQNumber_ValidationError" class="ValidationErrorText"></SPAN></TD>

<TD name="ColumnData" valign="top" width="200" align="Center"><SPAN name="RFQTitle" class="outputData">Flyer - 1Pp Or 2Pp</SPAN><SPAN name="RFQTitle_ValidationError" class="ValidationErrorText"></SPAN></TD>

<TD name="ColumnData" valign="top" width="200" align="Center"><SPAN name="ProductTypeDesc" class="outputData">CLK3 offset colour between 1,000 and 10,000</SPAN><SPAN name="ProductTypeDesc_ValidationError" class="ValidationErrorText"></SPAN></TD>

Post by **Marcus Tettmar** » Fri Mar 12, 2010 1:44 pm

When I click on each item with the tag extractor I get SPAN elements returned. I think you should be using SPAN rather than TD. Each TD contains a SPAN anyway and has no text because the text is within the SPAN. So either you use SPAN or set extracttag to return all HTML and then parse - may as well go with SPAN.

The last "cell" in the table gives me:

Let>SPAN5_SIZE=4098
IE_ExtractTag>%IE[0]%,,SPAN,5,0,SPAN5,r
MidStr>r_6,1,r,SPAN5

Span 0 contains: 12/03/10 09:00:00
Span 1 is empty
Span 2 contains: link 131091
Span 3 contains: Flyer - 1Pp Or 2Pp
Span 4 is empty
Span 5 contains: CLK3 offset colour between 1,000 and 10,000

Optimus · Post by **Optimus** » Fri Mar 12, 2010 2:41 pm

I agree that when you use the tag extractor and click on each item you get the SPAN element returned except for the second cell. I can't test this right now but the link in the second cell is NOT within a SPAN (I can't remember what it returns, probably TD). However, if you click in the white space around each item (but within its cell) you get a TD element. From a coding perspective, this seemed less obscure. Iterating through the SPAN elements should work but it means you have to filter out the (hidden) empty SPAN elements used for validation which means extra coding.

I could rework my code to parse the SPAN elements but how can I be sure I won't run into the same memory problem? I just don't feel I've nailed the cause of the problem yet. Remember I've been successfully parsing on TD for nearly a year now. IE_ExtractTag is essentially a black box and I cannot see what it is doing.

Starting to ramble ... I'm going to go away now and have a think of what has been put forward.

Thanks.

Optimus · Post by **Optimus** » Wed Mar 17, 2010 3:24 am

My memory error mysteriously disappeared at some stage but I was still not able to read past the first link. I guess you take what you get when dealing with a 3rd party site.

Following from Marcus' suggestion to use SPAN, I modified my code to extract the link (A) when trying to parse the second cell only.

Subsequent attempts to read later links in the table simply returned the first link text value.

In addition to the size assignment Marcus suggested earlier, he also said to set the return value to nothing before using. This fixed the parsing problem and I've been running my script successfully for more than 24 hours now.

Code: Select all

SRT>Get_Link_Text
  Let>link_text_SIZE=4098
  Let>link_text=
  IE_ExtractTag>%IE[0]%,,A,Get_Link_Text_Var_1,0,link_text,r
END>Get_Link_Text

Thanks to Marcus again.

IE_ExtractTag access violation module ntdll

IE_ExtractTag access violation module ntdll

IE_ExtractTag string or buffer limit?

Sample HTML

Tag Extractor

Initialise variables!