<?xml version="1.0" encoding="UTF-8" ?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.mensfashionforless.com/</loc>
<priority>1.000</priority>
</url>
...
<url>
<loc>http://www.mensfashionforless.com/black-jacket.html</loc>
<priority>0.5000</priority>
</url>
</urlset>
<loc> is the tag you use to indicate an URL. They are the URLs I'd like to spider.
Here is the Unix shell script!
To achieve this purpose I first extract all the URLs; then I issue an HTTP request to them one by one. Keep in mind I don't need to see the content at all; I just need to issue the request so that the server receives the request and does what it's supposed to do. A good use case is that my server caches webpages on demand. So I use this crawler to make my server cache all the webpages specified in sitemap.xml so that later when someone visits my website they'll see the webpage more quickly.
# spider.sh: use awk to get URLs from an XML sitemap # and use wget to spider every one of them ff() { while read line1; do wget --spider $line1 done } awk '{if(match($0,"<loc>")) {sub(/<\/loc>.*$/,"",$0); sub(/<loc>/,"",$0); print $0}}' sitemap.xml | ffThe above script should run successfully in C shell, Bourne shell, Korne shell. If not let me know!
In the script above I use 'awk' to extract URLs and use 'wget' to spider each of the URLs without downloading the contents (done via --spider option). Save it as 'spider.sh' and run 'chmod 700 spider.sh' and run './spider.sh' to spider your sitemap.xml!
If you have any questions please let me know and I will do my best to help you!