网络上的爬虫非常多,有对网站收录有益的,比如百度蜘蛛(Baiduspider),也有不但不遵守robots规则对服务器造成压力,还不能为网站带来流量的无用爬虫,就不点名了。
【Nginx禁止爬虫访问的方法】
if ($http_user_agent ~* "Scrapy|Baiduspider|Curl|HttpClient|Bytespider|FeedDemon|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser |ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|YisouSpider|HttpClient|MJ12bot|heritrix|EasouSp ider|Ezooms|^$"){ return 403; }
如需跳转其他页面,只需要吧return 403 换成对于的地址即可,配置如下:
if ($http_user_agent ~* "Scrapy|Baiduspider|Curl|HttpClient|Bytespider|FeedDemon|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser |ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|YisouSpider|HttpClient|MJ12bot|heritrix|EasouSp ider|Ezooms|^$") { return 301 https://yoursite.com; }
如需禁止特定来源用户,配置如下:
if ($http_referer ~ "baidu\.com|google\.net|bing\.com") { return 403; }
如需仅允许GET,HEAD和POST请求,配置如下:
#fbrbidden not GET|HEAD|POST method access if ($request_method !~ ^(GET|HEAD|POST)$) { return 403; }
【Apache禁用爬虫的配置】
mod_rewrite模块确定开启的前提下,在.htaccess文件或者相应的.conf文件,添加以下内容:
RewriteEngine On RewriteCond %{HTTP_USER_AGENT} (^$|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms) [NC] RewriteRule . - [R=403,L]
登录后可发表评论