Sep 19, 2016

f Comment

How Does CloudFlare CDN Treat Bots And Crawlers? Detailed Examples Provided.

Amazon Do you use CloudFlare? I have been using it for several years now and I am very satisfied. However, there's only one problem. How does CloudFlare treat bots and crawlers? For example, how does CloudFlare treat Google bot and Bing bot? Does CloudFlare treat bots differently? Does CloudFlare treat bots just like normal traffic from web browsers?

CloudFlare is a free global CDN, DNS, DDoS protection & web security provider that can speed up and protect websites.

I asked CloudFlare this question but they said how CloudFlare treats bots is a secret and they cannot share it with me. So to answer these questions I must run a thorough experiment myself. Read on for my amazing findings.

Why do I need to know?

First you may be curious why I need to know this information. The reason is simple. I'd like to block bots and crawlers I don't like by returning 403 to them, but if I do so, will it affect normal web traffic? Will it affect normal visitors of my website? If so, how?

Since CloudFlare is unwilling to tell me, I must find out by myself. Let's assume my origin server returns 403 HTTP status code to proximic crawler and returns 200 HTTP status code to Google bot, and for the purpose of this discussion, the URL refers to this URL:

http://www.chtoen.com/%E6%8B%9B%E7%94%9F%E7%9A%84%E8%8B%B1%E6%96%87%E6%80%8E%E9%BA%BC%E8%AA%AA.

The following is a list of scenarios I have tested by using the "curl" command to simulate bots by inserting their user-agent string.

The commands that were run in the following sections were run continuously; in other words they were executed in the same session.

Does CF hit origin server for requests from bots?

Let's pretend I am proximic bot and hit the URL.

ubuntu@daltrac:~$ curl -I -A "Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)" http://www.chtoen.com/%E6%8B%9B%E7%94%9F%E7%9A%84%E8%8B%B1%E6%96%87%E6%80%8E%E9%BA%BC%E8%AA%AA
HTTP/1.1 200 OK
Date: Tue, 20 Sep 2016 02:16:04 GMT
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Set-Cookie: __cfduid=d854007058e1046d861a854c6554499901474337764; expires=Wed, 20-Sep-17 02:16:04 GMT; path=/; domain=.chtoen.com; HttpOnly
Vary: Accept-Encoding
Cache-Control: public, max-age=604800
Expires: Tue, 27 Sep 2016 02:16:04 GMT
X-Powered-By: PHP/5.5.9-1ubuntu4.19
X-Page-Speed: 1.11.33.2-0
CF-Cache-Status: HIT
Server: cloudflare-nginx
CF-RAY: 2e51c5f1603c0823-SIN

Now I clean CloudFlare cache for the URL.

ubuntu@daltrac:~$ curl -I -A "Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)" http://www.chtoen.com/%E6%8B%9B%E7%94%9F%E7%9A%84%E8%8B%B1%E6%96%87%E6%80%8E%E9%BA%BC%E8%AA%AA
HTTP/1.1 403 Forbidden
Date: Tue, 20 Sep 2016 02:17:06 GMT
Content-Type: text/html
Connection: keep-alive
Set-Cookie: __cfduid=d2112befe96b909c3890c43acbef4442d1474337826; expires=Wed, 20-Sep-17 02:17:06 GMT; path=/; domain=.chtoen.com; HttpOnly
Vary: Accept-Encoding
CF-Cache-Status: MISS
Server: cloudflare-nginx
CF-RAY: 2e51c776f510310e-SIN

As you can see, the answer is YES. CF hits the origin server for HTTP requests from bots.

Does CloudFlare cache 403 response?

As mentioned earlier, the following command was run immediately after the previous command.

ubuntu@daltrac:~$ curl -I -A "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" http://www.chtoen.com/%E6%8B%9B%E7%94%9F%E7%9A%84%E8%8B%B1%E6%96%87%E6%80%8E%E9%BA%BC%E8%AA%AA
HTTP/1.1 403 Forbidden
Date: Tue, 20 Sep 2016 02:17:11 GMT
Content-Type: text/html
Connection: keep-alive
Set-Cookie: __cfduid=d7aca34979bcefffdfbecf672dde7e1711474337831; expires=Wed, 20-Sep-17 02:17:11 GMT; path=/; domain=.chtoen.com; HttpOnly
Vary: Accept-Encoding
CF-Cache-Status: HIT
Server: cloudflare-nginx
CF-RAY: 2e51c794e15f31da-SIN

Googlebot would get 403 in this situation! So CF does cache 403 response!

That begs the question: Will web browser get 403 if visitor hits the URL now? Read on to find the answer.

However, CF cache with 403 would expire in one minute, as you can see in the following:

ubuntu@daltrac:~$ curl -I -A "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" http://www.chtoen.com/%E6%8B%9B%E7%94%9F%E7%9A%84%E8%8B%B1%E6%96%87%E6%80%8E%E9%BA%BC%E8%AA%AA
HTTP/1.1 200 OK
Date: Tue, 20 Sep 2016 02:19:01 GMT
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Set-Cookie: __cfduid=db4a58c37d86229ca0aef5cf49ff437c01474337941; expires=Wed, 20-Sep-17 02:19:01 GMT; path=/; domain=.chtoen.com; HttpOnly
Vary: Accept-Encoding
Cache-Control: public, max-age=604800
Expires: Tue, 27 Sep 2016 02:19:01 GMT
X-Powered-By: PHP/5.5.9-1ubuntu4.19
X-Page-Speed: 1.11.33.2-0
CF-Cache-Status: EXPIRED
Server: cloudflare-nginx
CF-RAY: 2e51ca47642d31a4-SIN

Since Googlebot gets 200, now proximic will get 200 too.

ubuntu@daltrac:~$ curl -I -A "Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)" http://www.chtoen.com/%E6%8B%9B%E7%94%9F%E7%9A%84%E8%8B%B1%E6%96%87%E6%80%8E%E9%BA%BC%E8%AA%AA
HTTP/1.1 200 OK
Date: Tue, 20 Sep 2016 02:19:27 GMT
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Set-Cookie: __cfduid=d683f15c31e6a60e6e1f79d7f67effcbb1474337967; expires=Wed, 20-Sep-17 02:19:27 GMT; path=/; domain=.chtoen.com; HttpOnly
Vary: Accept-Encoding
Cache-Control: public, max-age=604800
Expires: Tue, 27 Sep 2016 02:19:27 GMT
X-Powered-By: PHP/5.5.9-1ubuntu4.19
X-Page-Speed: 1.11.33.2-0
CF-Cache-Status: HIT
Server: cloudflare-nginx
CF-RAY: 2e51cae9040831aa-SIN

Since this CF cache is 200, CF will cache it according to my cache control header instruction, which indicates 7 days (604800 seconds = 7 days). Let's wait a couple of minutes and use proximic user agent string to hit the URL again to prove this.

ubuntu@daltrac:~$ curl -I -A "Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)" http://www.chtoen.com/%E6%8B%9B%E7%94%9F%E7%9A%84%E8%8B%B1%E6%96%87%E6%80%8E%E9%BA%BC%E8%AA%AA
HTTP/1.1 200 OK
Date: Tue, 20 Sep 2016 02:27:16 GMT
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Set-Cookie: __cfduid=d4e6c70d1a6580312cd0551b7bbd590081474338436; expires=Wed, 20-Sep-17 02:27:16 GMT; path=/; domain=.chtoen.com; HttpOnly
Vary: Accept-Encoding
Cache-Control: public, max-age=604800
Expires: Tue, 27 Sep 2016 02:27:16 GMT
X-Powered-By: PHP/5.5.9-1ubuntu4.19
X-Page-Speed: 1.11.33.2-0
CF-Cache-Status: HIT
Server: cloudflare-nginx
CF-RAY: 2e51d65c80bb3084-SIN

By looking at the time stamp of each run, you can see it is proven that when a bot gets 403, CF caches the 403 response for one minute for other bots. When a bot gets 200, CF caches it per cache control headers of the HTTP response for other bots.

I claim for other bots above because we've been simulating HTTP requests from bots. This may not be true, but it's an educated guess.

An important conclusion we can draw is, you cannot block any bots by having your origin server send 403 because you may cause good bots to get 403 even though it's only for a one-minute window at most.

Does CF treat bots differently from normal traffic?

Now let's test the case where after a bot gets 403, some web browser hits the URL immediately. Will the web browser get 403 too? If so, that's disaster.

Let's clean CF cache now.

Let's use proximic user-agent text to hit the URL now.

ubuntu@daltrac:~$ curl -I -A "Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)" http://www.chtoen.com/%E6%8B%9B%E7%94%9F%E7%9A%84%E8%8B%B1%E6%96%87%E6%80%8E%E9%BA%BC%E8%AA%AA
HTTP/1.1 403 Forbidden
Date: Tue, 20 Sep 2016 02:31:16 GMT
Content-Type: text/html
Connection: keep-alive
Set-Cookie: __cfduid=dfd859835e59964beb57f59e3f11d43bb1474338676; expires=Wed, 20-Sep-17 02:31:16 GMT; path=/; domain=.chtoen.com; HttpOnly
Vary: Accept-Encoding
CF-Cache-Status: MISS
Server: cloudflare-nginx
CF-RAY: 2e51dc35077b3204-SIN

Let's use the Chrome web browser to hit the URL now. The result is my origin server gets the request, and Chrome gets 200, meaning CF would hit origin server even though one minute is not up yet! Now CF has the webpage in its cache.

This means CF treats traffic from bots differently and separately from traffic from web browsers!

This is good to know because even though I return 403 to some bots, I can be rest assured CF will never return 403 to human visitors.

Don't believe me this is the case? Let's read on.

Verify CF treats human visitors and bots differently.

Let's be proximic again.

ubuntu@daltrac:~$ curl -I -A "Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)" http://www.chtoen.com/%E6%8B%9B%E7%94%9F%E7%9A%84%E8%8B%B1%E6%96%87%E6%80%8E%E9%BA%BC%E8%AA%AA
HTTP/1.1 403 Forbidden
Date: Tue, 20 Sep 2016 02:32:40 GMT
Content-Type: text/html
Connection: keep-alive
Set-Cookie: __cfduid=d943da911a8a92d73c9ccb606811b4f821474338760; expires=Wed, 20-Sep-17 02:32:40 GMT; path=/; domain=.chtoen.com; HttpOnly
Vary: Accept-Encoding
CF-Cache-Status: EXPIRED
Server: cloudflare-nginx
CF-RAY: 2e51de46251d0823-SIN

Funnily, proximic gets 403 even though CF has this webpage in its cache. My origin server is hit and returns 403.

I use chrome to hit the URL again by pressing F5, and I got CF-Cache-Status:HIT in HTTP response header. Origin server indeed is not hit.

Let's be proximic again now.

ubuntu@daltrac:~$ curl -I -A "Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)" http://www.chtoen.com/%E6%8B%9B%E7%94%9F%E7%9A%84%E8%8B%B1%E6%96%87%E6%80%8E%E9%BA%BC%E8%AA%AA
HTTP/1.1 403 Forbidden
Date: Tue, 20 Sep 2016 02:34:11 GMT
Content-Type: text/html
Connection: keep-alive
Set-Cookie: __cfduid=d64a84ba6df35d2f8cdf76ccd840cd94d1474338851; expires=Wed, 20-Sep-17 02:34:11 GMT; path=/; domain=.chtoen.com; HttpOnly
Vary: Accept-Encoding
CF-Cache-Status: EXPIRED
Server: cloudflare-nginx
CF-RAY: 2e51e07f41d53174-SIN

My origin server is hit and returns 403.

Now I go to the URL in Chrome and Firefox and both have CF-Cache-Status:HIT in HTTP response header! Origin server is not hit.

It means CF does treat bots and human traffic differently and separately!

Just to be sure, let's be Googlebot and hit the URL again.

ubuntu@daltrac:~$ curl -I -A "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" http://www.chtoen.com/%E6%8B%9B%E7%94%9F%E7%9A%84%E8%8B%B1%E6%96%87%E6%80%8E%E9%BA%BC%E8%AA%AA
HTTP/1.1 200 OK
Date: Tue, 20 Sep 2016 02:35:36 GMT
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Set-Cookie: __cfduid=d4ad578ec6957d13740a15ad911d9fc441474338936; expires=Wed, 20-Sep-17 02:35:36 GMT; path=/; domain=.chtoen.com; HttpOnly
Vary: Accept-Encoding
Cache-Control: public, max-age=604800
Expires: Tue, 27 Sep 2016 02:35:36 GMT
X-Powered-By: PHP/5.5.9-1ubuntu4.19
X-Page-Speed: 1.11.33.2-0
CF-Cache-Status: EXPIRED
Server: cloudflare-nginx
CF-RAY: 2e51e291f676320a-SIN

Origin server is hit. It's good that CF gives my origin server a chance to handle traffic from bots/crawlers by myself.

Actually, ignore the above. I was unknowingly testing two different CF regions LA and Singapore, and hence the observed caching behavior.

I just tested from the same CF region, and the following is the correct behavior.

W:\>curl -I -A "Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)" http://www.chtoen.com/%E6%8B%9B%E7%94%9F%E7%9A%84%E8%8B%B1%E6%96%87%E6%80%8E%E9%BA%BC%E8%AA%AA
HTTP/1.1 403 Forbidden
Date: Tue, 20 Sep 2016 05:03:43 GMT
Content-Type: text/html
Connection: keep-alive
Set-Cookie: __cfduid=d5885600e42b92ab87ef6ed46997384a21474347823; expires=Wed, 20-Sep-17 05:03:43 GMT; path=/; domain=.chtoen.com; HttpOnly
Vary: Accept-Encoding
CF-Cache-Status: EXPIRED
Server: cloudflare-nginx
CF-RAY: 2e52bb87410c2258-LAX

Then I use Chrome to go the URL and I get 403! Oh my god this is not what I want!

It means CF caches the webpage for one minute even though my intent is for my origin server to return 403 only to undesired bots. Good thing is CF only caches 403 responses for one minute. When one minute is up, I refresh Chrome and I get 200, which is what I want.

Then I issue the same curl command again after a few minutes:

W:\>curl -I -A "Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)" http://www.chtoen.com/%E6%8B%9B%E7%94%9F%E7%9A%84%E8%8B%B1%E6%96%87%E6%80%8E%E9%BA%BC%E8%AA%AA
HTTP/1.1 200 OK
Date: Tue, 20 Sep 2016 05:10:37 GMT
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Set-Cookie: __cfduid=dc30dad9556d605a7e0cc92b7ac3bf70d1474348237; expires=Wed, 20-Sep-17 05:10:37 GMT; path=/; domain=.chtoen.com; HttpOnly
Vary: Accept-Encoding
Cache-Control: public, max-age=604800
Expires: Tue, 27 Sep 2016 05:10:37 GMT
X-Powered-By: PHP/5.5.9-1ubuntu4.19
X-Page-Speed: 1.11.33.2-0
CF-Cache-Status: HIT
Server: cloudflare-nginx
CF-RAY: 2e52c5a530f853e4-LAX

And it's a hit because the webpage has been cached by CF for longer than one minute because it's 200 instead of 403.

Therefore, the conclusion is my origin server should NOT return 403 to anybody because it will get cached by CF for one minute undesirably.

Questions? Let me know!
Please leave a comment here!
One Minute Information - by Michael Wen
ADVERTISING WITH US - Direct your advertising requests to Michael