Hi,
does anyone know how can one scrape the urls from a google query?
Can't seem to find an answer
Any help appriciated.
URL scraping
Moderators: Dorian (MJT support), JRL
- CyberCitizen
- Automation Wizard
- Posts: 721
- Joined: Sun Jun 20, 2004 7:06 am
- Location: Adelaide, South Australia
Not in a position to test at the moment, but couldn't you do a HTTPRequest save the results to a file or even keep them as a VAR & then use RegEx Easy Pattens [URL] to return the links?
Or are you talking about once you have the results of the links it scrapes the data from each page of the search results.
Or are you talking about once you have the results of the links it scrapes the data from each page of the search results.
FIREFIGHTER
- Dorian (MJT support)
- Automation Wizard
- Posts: 1389
- Joined: Sun Nov 03, 2002 3:19 am
- Contact:
Cyber's solution is the best one. Using HTTPrequest and Regex. The challenge with Google is that sometimes the layout changes, causing Regex to fail. Plus it litters the results with paid ads and wikipedia links.
But it is certainly possible. We just finished writing a custom script for someone which did exactly this.
But it is certainly possible. We just finished writing a custom script for someone which did exactly this.
Yes, we have a Custom Scripting Service. Message me or go here
Thank you for your replyes.
I've tried HTTPRequest with several RegEx patterns.
I almost always end up with several 5-20 URLs but they all are http://www.google.
This is my latest regex
The whole pattern can't be shown here but it's from http://alanstorm.com/url_regex_explained
I look for the results in debugger
Can't figure this out
I've tried HTTPRequest with several RegEx patterns.
I almost always end up with several 5-20 URLs but they all are http://www.google.
This is my latest regex
Code: Select all
Let>HTTP_SSL=1
HTTPRequest>https://www.google.co.uk/#q=filetype:swf+dentist+london,,GET,,res
Let>Pat=[paste from http://alanstorm.com/url_regex_explained]
RegEx>Pat,res,0,match,num,,,
mdl>match
I look for the results in debugger
Can't figure this out
- Marcus Tettmar
- Site Admin
- Posts: 7395
- Joined: Thu Sep 19, 2002 3:00 pm
- Location: Dorset, UK
- Contact:
Why try and scrape the Google front end? It was designed for a human to look at, has all kinds of dynamic stuff in it, adverts and all sorts.
Instead you should use Google's simple API which returns pure results with less clutter.
Try this example:
Instead you should use Google's simple API which returns pure results with less clutter.
Try this example:
Code: Select all
//location of output file
Let>out_file=%SCRIPT_DIR%\google_results.txt
//specify the search term:
Let>theSearchTerm=Nord Keyboards
//specify how many results youwant:
Let>theQuantityWanted=20
//replace spaces insearch term with + symbol
StringReplace>theSearchTerm,SPACE,+,theSearchTerm
DeleteFile>out_file
Let>start=1
Let>numresults=0
Label>get_results
//Uses Google's Ajax API - Simpler For Parsing - returns 8 at a time (use start=)
Let>url=http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=%theSearchTerm%&rsz=large&start=%start%
HTTPRequest>url,,GET,,htmlRES
//pull out all the URLs
RegEx>(?<="url":").*?(?="),htmlRes,0,URLs,numURLS,0
//loop through all the URLs
Let>sres=0
Repeat>sres
Let>sres=sres+1
Let>this_url=URLS_%sres%
Let>xlRow=xlRow+1
If>numresults<theQuantityWanted
WriteLn>out_file,wlres,this_url
Endif
Let>numresults=numresults+1
Until>sres=numURLS
If>numresults<theQuantityWanted
Let>start=start+8
Goto>get_results
Endif
ExecuteFile>out_file
Marcus Tettmar
http://mjtnet.com/blog/ | http://twitter.com/marcustettmar
Did you know we are now offering affordable monthly subscriptions for Macro Scheduler Standard?
http://mjtnet.com/blog/ | http://twitter.com/marcustettmar
Did you know we are now offering affordable monthly subscriptions for Macro Scheduler Standard?