<?xml version="1.0" encoding="UTF-8" ?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.mensfashionforless.com/</loc>
<priority>1.000</priority>
</url>
...
<url>
<loc>http://www.mensfashionforless.com/black-jacket.html</loc>
<priority>0.5000</priority>
</url>
</urlset>
<loc> is the tag you use to indicate an URL. They are the URLs I'd like to spider.
Here is the Unix shell script!
To achieve this purpose I first extract all the URLs; then I issue an HTTP request to them one by one. Keep in mind I don't need to see the content at all; I just need to issue the request so that the server receives the request and does what it's supposed to do. A good use case is that my server caches webpages on demand. So I use this crawler to make my server cache all the webpages specified in sitemap.xml so that later when someone visits my website they'll see the webpage more quickly.
1 | # spider.sh: use awk to get URLs from an XML sitemap |
2 | # and use wget to spider every one of them |
3 | ff() |
4 | { |
5 | while read line1; do |
6 | wget --spider $line1 |
7 | done |
8 | } |
9 | awk '{if(match($0,"<loc>")) {sub(/<\/loc>.*$/,"",$0); sub(/<loc>/,"",$0); print $0}}' sitemap.xml | ff |
In the script above I use 'awk' to extract URLs and use 'wget' to spider each of the URLs without downloading the contents (done via --spider option). Save it as 'spider.sh' and run 'chmod 700 spider.sh' and run './spider.sh' to spider your sitemap.xml!
If you have any questions please let me know and I will do my best to help you!