Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Facebook externalhit_uatext robot lowercasing urls

Tags:

url

facebook

I'm working on a site that has mixed-case urls, similar to youtube. We generate IDs on the server, and I chose base 62 (numbers, lower and uppercase letters) so they would be shorter. So the urls might be something like example.com/user/123AbCaBc The facebook robot seems to be hitting my site regularly with an all-lowercase version example.com/user/123abcabc This causes a 404 error as the all-lowercase ID isn't in the database.

According to the logs, there aren't other user agents creating 404s, so this is for sure a robot and not a human. Here's the user agent I'm seeing:

facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)

This happens about once every 4 minutes. I'm not currently logging non-404 hits, so I'm not sure if there are others to the non-lowercase version.

The server tech here is nodejs / mongodb, but I don't see how that is relavant to the issue at hand.

Is there something I can do to fix facebook? Is there a problem here, or should I squealch these log errors? Anyone else have a similar problem?

like image 290
Will Shaver Avatar asked Oct 19 '22 20:10

Will Shaver


1 Answers

It's possible that you Node "Webserver application" (are you using Express?) currently doesn't support byte ranges. The Facebook crawler apparantly has the behaviour to fallback on lowercasing the URL as described here:

  • https://mail.habari.co.tz/pipermail/linux/2013-June/000180.html

Have a look at

  • http://derickbailey.com/2014/04/28/check-http-byte-range-request-header-with-nodejs-and-expressjs/
  • http://www.codeproject.com/Articles/813480/HTTP-Partial-Content-In-Node-js

on how to fix this.

like image 135
Tobi Avatar answered Oct 22 '22 22:10

Tobi