Web Articles Scraping

Semper · Post by **Semper** » Sat Oct 26, 2013 10:39 am

Hi,

would someone be kind enough to point or just hint me to the right direction.
How would I scrape all the articles from a site like http://www.jugantor.com (where there are no rss feeds) for a later translation.

I've used MS for alot of thing but never for web so I'm a little confused.

I started with this but got stuck

Code: Select all

IECreate>IE[0]
IENavigate>%IE[0]%,http://www.jugantor.com/,r
IEWait>%IE[0]%,r
Wait>delay
IEGetTags>www.jugantor.com,DIV,O,divArr
DeleteFile>C:\temp\jugantor.txt
If>divArr_count>0
  Let>k=0
  Repeat>k
    Let>k=k+1
    Let>this_div_html=divArr_%k%
    WriteLn>C:\temp\jugantor.txt,resvar,%this_div_html%
  Until>k=divArr_count
Endif

The writeln was just to se what I'm dealing with.
Or maybe there is a whole different aproach that I'm not aware of

Thanks.

Post by **Marcus Tettmar** » Mon Nov 11, 2013 2:25 pm

You are on the right lines.

This will help you identify the elements:
http://www.mjtnet.com/blog/2012/10/02/f ... web-sites/

Often I will use the extracttag function(s) to extract a container element and then use RegEx for the rest.