URL scraping

Technical support and scripting issues

Moderators: Dorian (MJT support), JRL

Post Reply
Semper
Junior Coder
Posts: 30
Joined: Mon Feb 25, 2008 3:28 pm

URL scraping

Post by Semper » Tue Oct 01, 2013 7:37 am

Hi,

does anyone know how can one scrape the urls from a google query?
Can't seem to find an answer :?
Any help appriciated.

User avatar
CyberCitizen
Automation Wizard
Posts: 721
Joined: Sun Jun 20, 2004 7:06 am
Location: Adelaide, South Australia

Post by CyberCitizen » Tue Oct 01, 2013 2:07 pm

Not in a position to test at the moment, but couldn't you do a HTTPRequest save the results to a file or even keep them as a VAR & then use RegEx Easy Pattens [URL] to return the links?

Or are you talking about once you have the results of the links it scrapes the data from each page of the search results.
FIREFIGHTER

User avatar
Dorian (MJT support)
Automation Wizard
Posts: 1389
Joined: Sun Nov 03, 2002 3:19 am
Contact:

Post by Dorian (MJT support) » Wed Oct 02, 2013 11:53 pm

Cyber's solution is the best one. Using HTTPrequest and Regex. The challenge with Google is that sometimes the layout changes, causing Regex to fail. Plus it litters the results with paid ads and wikipedia links.

But it is certainly possible. We just finished writing a custom script for someone which did exactly this.
Yes, we have a Custom Scripting Service. Message me or go here

Semper
Junior Coder
Posts: 30
Joined: Mon Feb 25, 2008 3:28 pm

Post by Semper » Thu Oct 03, 2013 8:10 am

Thank you for your replyes.

I've tried HTTPRequest with several RegEx patterns.
I almost always end up with several 5-20 URLs but they all are http://www.google.

This is my latest regex

Code: Select all

Let>HTTP_SSL=1
HTTPRequest>https://www.google.co.uk/#q=filetype:swf+dentist+london,,GET,,res
Let>Pat=[paste from http://alanstorm.com/url_regex_explained]
RegEx>Pat,res,0,match,num,,,
mdl>match
The whole pattern can't be shown here but it's from http://alanstorm.com/url_regex_explained

I look for the results in debugger

Can't figure this out :(

User avatar
Marcus Tettmar
Site Admin
Posts: 7395
Joined: Thu Sep 19, 2002 3:00 pm
Location: Dorset, UK
Contact:

Post by Marcus Tettmar » Thu Oct 03, 2013 1:48 pm

Why try and scrape the Google front end? It was designed for a human to look at, has all kinds of dynamic stuff in it, adverts and all sorts.

Instead you should use Google's simple API which returns pure results with less clutter.

Try this example:

Code: Select all

//location of output file
Let>out_file=%SCRIPT_DIR%\google_results.txt
//specify the search term:
Let>theSearchTerm=Nord Keyboards
//specify how many results youwant:
Let>theQuantityWanted=20
  
//replace spaces insearch term with + symbol
StringReplace>theSearchTerm,SPACE,+,theSearchTerm

DeleteFile>out_file

Let>start=1
Let>numresults=0
Label>get_results
  //Uses Google's Ajax API - Simpler For Parsing - returns 8 at a time (use start=)
  Let>url=http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=%theSearchTerm%&rsz=large&start=%start%
  HTTPRequest>url,,GET,,htmlRES

  //pull out all the URLs
  RegEx>(?<="url":").*?(?="),htmlRes,0,URLs,numURLS,0

  //loop through all the URLs
  Let>sres=0
  Repeat>sres
    Let>sres=sres+1
    Let>this_url=URLS_%sres%
    Let>xlRow=xlRow+1
    If>numresults<theQuantityWanted
      WriteLn>out_file,wlres,this_url
    Endif
    Let>numresults=numresults+1
  Until>sres=numURLS
    
  If>numresults<theQuantityWanted
    Let>start=start+8
    Goto>get_results
  Endif

ExecuteFile>out_file
Marcus Tettmar
http://mjtnet.com/blog/ | http://twitter.com/marcustettmar

Did you know we are now offering affordable monthly subscriptions for Macro Scheduler Standard?

Semper
Junior Coder
Posts: 30
Joined: Mon Feb 25, 2008 3:28 pm

Post by Semper » Thu Oct 03, 2013 6:21 pm

Thanks Marcus!

Just what I need :D

Post Reply
Sign up to our newsletter for free automation tips, tricks & discounts