Extract html from this text ?

Technical support and scripting issues

Moderators: Dorian (MJT support), JRL

Post Reply
travel88
Newbie
Posts: 15
Joined: Sun May 20, 2007 11:42 am

Extract html from this text ?

Post by travel88 » Sun Dec 28, 2008 1:22 pm

Hi,
Can you help me in writing a code preferably VB script, for extracting just the html part from the below text, which may vary but the html pattern is the same.

Code: Select all

: xxx  x xx x x  x x x x x  x  x x xxxx xxxxx  http://images.google.com/imgres?imgurl=http://jquery.com/demo/thickbox/images/plant4.jpg&imgrefurl=http://jquery.com/demo/thickbox/&usg=__-9nKVuEbmoL4CJiJMB9lzrsES3o=&h=480&w=640&sz=207&hl=en&start=7&um=1&tbnid=eWicC5rSDLtyCM:&tbnh=103&tbnw=137&prev=/images%3Fq%3Dimages%26um%3D1%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DN)
xxxx xxxxx
xxxxxx
The pattern starts with http:// and ends with the bracket ) .

Thanks for helping

User avatar
Bob Hansen
Automation Wizard
Posts: 2475
Joined: Tue Sep 24, 2002 3:47 am
Location: Salem, New Hampshire, US
Contact:

Post by Bob Hansen » Sun Dec 28, 2008 3:26 pm

Use the VBScript RegEx command.

Good examples from Marcus at this link: http://www.mjtnet.com/forum/viewtopic.p ... ight=regex
Hope this was helpful..................good luck,
Bob
A humble man and PROUD of it!

travel88
Newbie
Posts: 15
Joined: Sun May 20, 2007 11:42 am

Post by travel88 » Mon Dec 29, 2008 3:18 pm

Sorry I tried my level best . Cannot get it right. Hope you can help.
Thanks

User avatar
Bob Hansen
Automation Wizard
Posts: 2475
Joined: Tue Sep 24, 2002 3:47 am
Location: Salem, New Hampshire, US
Contact:

Post by Bob Hansen » Mon Dec 29, 2008 11:10 pm

Provide your script so it can be reviewed and edited.....
Hope this was helpful..................good luck,
Bob
A humble man and PROUD of it!

travel88
Newbie
Posts: 15
Joined: Sun May 20, 2007 11:42 am

Post by travel88 » Thu Jan 01, 2009 5:17 am

Code: Select all

//A VBScript Function to search a string for a regex pattern
//returns a list of matches separated by semicolons
VBSTART
Function regExSearch(patrn,str)
  Set regEx = New RegExp ' Create regular expression.
  regEx.Pattern = patrn ' Set pattern.
  regEx.IgnoreCase = True ' Make case insensitive. Default=False
  Set matches = RegEx.Execute(str)
  List = ""
  For each match in matches
  	 List = List & match.value & ";"
  Next
  regExSearch = Mid(List,1,Len(List)-1)
End Function
VBEND

//Read the file contents into a variable
ReadFile>C:\MSG1.txt,FileData

//replace CRLF chars with VBScript equivalents
StringReplace>FileData,CR," & vbCR & ",FileData
StringReplace>FileData,LF," & vbLF & ",FileData
//Double quote any quotes for VBScript
StringReplace>FileData,","",FileData

//Perform the regex search
VBEval>regExSearch("REGEX_PATTERN","%FileData%"),URLList

//We now have a semicolon delimited list of URLs.  We could explode this into an array:
Separate>URLList,;,URLS
If>URLS_COUNT>0
  Let>k=1
  Repeat>k
    Let>ThisURL=URLS_%k%
    MessageModal>ThisURL
	//we could write it to a file:
	WriteLn>C:\result.txt,result,ThisURL
    Let>k=k+1
  Until>k,URLS_COUNT
Endif

Gets VBscript runtime error, always.

Thanks

travel88
Newbie
Posts: 15
Joined: Sun May 20, 2007 11:42 am

Post by travel88 » Fri Jan 09, 2009 11:56 pm

Any help here ?

Thanks

Me_again
Automation Wizard
Posts: 1101
Joined: Fri Jan 07, 2005 5:55 pm
Location: Somewhere else on the planet

Post by Me_again » Sat Jan 10, 2009 12:14 am

VBEval>regExSearch("REGEX_PATTERN","%FileData%"),URLList

I believe it's choking on the VBEval because you have not defined "REGEX_PATTERN".

travel88
Newbie
Posts: 15
Joined: Sun May 20, 2007 11:42 am

Post by travel88 » Sat Jan 10, 2009 12:24 am

I need help in defining the pattern.

The URL from the above example has to be extracted starting from http:// and ends at closed bracket ")" .

Thanks very much for helping me in this , as I have been scratching my head for long ...... :?

travel88
Newbie
Posts: 15
Joined: Sun May 20, 2007 11:42 am

Post by travel88 » Thu Jan 15, 2009 5:22 pm

Any help ?

User avatar
Marcus Tettmar
Site Admin
Posts: 7380
Joined: Thu Sep 19, 2002 3:00 pm
Location: Dorset, UK
Contact:

Post by Marcus Tettmar » Thu Jan 15, 2009 5:35 pm

Detecting URLs in a block of text can never be 100% reliable. Regex Guru Jan Goyvaerts has a good post on this here.

But we can use Jan's suggested regex:

Code: Select all

VBSTART
Function regExSearch(patrn,str)
  Set regEx = New RegExp ' Create regular expression.
  regEx.Pattern = patrn ' Set pattern.
  regEx.IgnoreCase = True ' Make case insensitive. Default=False
  Set matches = RegEx.Execute(str)
  List = ""
  For each match in matches
       List = List & match.value & ";"
  Next
  regExSearch = Mid(List,1,Len(List)-1)
End Function
VBEND

Let>FileData=: xxx  x xx x x  x x x x x  x  x x xxxx xxxxx  http://images.google.com/imgres?imgurl=http://jquery.com/demo/thickbox/images/plant4.jpg&imgrefurl=http://jquery.com/demo/thickbox/&usg=__-9nKVuEbmoL4CJiJMB9lzrsES3o=&h=480&w=640&sz=207&hl=en&start=7&um=1&tbnid=eWicC5rSDLtyCM:&tbnh=103&tbnw=137&prev=/images%3Fq%3Dimages%26um%3D1%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DN) xxxx

//for VBScript double quote quotes and replace hard line breaks with vbCRLF
StringReplace>FileData,","",FileData
StringReplace>FileData,CRLF," & vbCRLF & ",FileData

VBEval>regExSearch("\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[A-Z0-9+&@#/%=~_|]","%FileData%"),URL
MessageModal>URL
Marcus Tettmar
http://mjtnet.com/blog/ | http://twitter.com/marcustettmar

Did you know we are now offering affordable monthly subscriptions for Macro Scheduler Standard?

travel88
Newbie
Posts: 15
Joined: Sun May 20, 2007 11:42 am

Post by travel88 » Sat Jan 17, 2009 3:29 am

Thankyou. It works exactly. :D

User avatar
jpuziano
Automation Wizard
Posts: 1085
Joined: Sat Oct 30, 2004 12:00 am

Post by jpuziano » Sat Jan 17, 2009 5:16 am

Hi Marcus,

Thanks for the code example above and the link... Jan is indeed a Regex guru and makes great products... I even admire the way his website/s are laid out... especially the Version History pages that give a good level of detail about every Improvement, Bug Fix or New Feature.

FYI - the code block in your post is not showing a vertical scroll bar so I can't scroll to see all the code... I have to CTRL-A copy all the lines and paste them elsewhere just to see them. Horizontal scrollbar is there, just not the vertical one.

Browser here is IE7 on XP SP3 with all the latest Windows updates, up-to-date as of 10 minutes ago (a bunch of security updates were installed).

Thanks again and take care
Last edited by jpuziano on Fri Feb 20, 2009 5:50 am, edited 3 times in total.
jpuziano

Note: If anyone else on the planet would find the following useful...
[Open] PlayWav command that plays from embedded script data
...then please add your thoughts/support at the above post - :-)

User avatar
jpuziano
Automation Wizard
Posts: 1085
Joined: Sat Oct 30, 2004 12:00 am

Post by jpuziano » Sat Jan 17, 2009 5:40 am

Hi travel88,

When you said:
  • The pattern starts with http:// and ends with the bracket ) .
Did you mean that the ) is actually part of the URL?

I tried the above code and for me, it clips the ) off the end when it returns the URL... however that may not be important to you because your URL seems to bring up the same web page whether you include the ) at the end or leave it off... at least that's what it did for me.

Take care
jpuziano

Note: If anyone else on the planet would find the following useful...
[Open] PlayWav command that plays from embedded script data
...then please add your thoughts/support at the above post - :-)

Post Reply
cron
Sign up to our newsletter for free automation tips, tricks & discounts