我从最经济实惠,简单粗暴开始说;不说废话,直接开整。

方法一:域名DNS托管到cloudflare,一键屏蔽AI爬虫

 

【免费分享】屏蔽AI蜘蛛和防止网站文章采集方法

如果访问不了cloudflare,那就需要自己搞定梯子。
(国内域名几乎不影响访问速度,有些人会觉得使用国内DNS速度快,其实速度差不多)

方法二:宝塔防火墙设置屏蔽AI爬虫(我用的是破解版宝塔,免费版不知道能不能设置)

  1. Amazonbot
  2. ClaudeBot
  3. PetalBot
  4. gptbot
  5. Ahrefs
  6. Semrush
  7. Imagesift
  8. Teoma
  9. ia_archiver
  10. twiceler
  11. MSNBot
  12. Scrubby
  13. Robozilla
  14. Gigabot
  15. yahoo-mmcrawler
  16. yahoo-blogs/v3.9
  17. psbot
  18. Scrapy
  19. SemrushBot
  20. AhrefsBot
  21. Applebot
  22. AspiegelBot
  23. DotBot
  24. DataForSeoBot
  25. java
  26. MJ12bot
  27. python
  28. seo
  29. Censys

复制代码

 

【免费分享】屏蔽AI蜘蛛和防止网站文章采集方法

 

 

【免费分享】屏蔽AI蜘蛛和防止网站文章采集方法

 

 

【免费分享】屏蔽AI蜘蛛和防止网站文章采集方法

方法三:复制下面的代码,保存为robots.txt,上传到网站根目录

  1. User-agent: Ahrefs
  2. Disallow: /
  3. User-agent: Semrush
  4. Disallow: /
  5. User-agent: Imagesift
  6. Disallow: /
  7. User-agent: Amazonbot
  8. Disallow: /
  9. User-agent: gptbot
  10. Disallow: /
  11. User-agent: ClaudeBot
  12. Disallow: /
  13. User-agent: PetalBot
  14. Disallow: /
  15. User-agent: Baiduspider
  16. Disallow:
  17. User-agent: Sosospider
  18. Disallow:
  19. User-agent: sogou spider
  20. Disallow:
  21. User-agent: YodaoBot
  22. Disallow:
  23. User-agent: Googlebot
  24. Disallow:
  25. User-agent: Bingbot
  26. Disallow:
  27. User-agent: Slurp
  28. Disallow:
  29. User-agent: Teoma
  30. Disallow: /
  31. User-agent: ia_archiver
  32. Disallow: /
  33. User-agent: twiceler
  34. Disallow: /
  35. User-agent: MSNBot
  36. Disallow: /
  37. User-agent: Scrubby
  38. Disallow: /
  39. User-agent: Robozilla
  40. Disallow: /
  41. User-agent: Gigabot
  42. Disallow: /
  43. User-agent: googlebot-image
  44. Disallow:
  45. User-agent: googlebot-mobile
  46. Disallow:
  47. User-agent: yahoo-mmcrawler
  48. Disallow: /
  49. User-agent: yahoo-blogs/v3.9
  50. Disallow: /
  51. User-agent: psbot
  52. Disallow:
  53. User-agent: dotbot
  54. Disallow: /

复制代码

 

方法四:防止网站被采集(宝塔配置文件保存以下代码)

  1. #禁止Scrapy等工具的抓取
  2. if ($http_user_agent ~* (Scrapy|Curl|HttpClient|crawl|curb|git|Wtrace)) {
  3.      return 403;
  4. }
  5. #禁止指定UA及UA为空的访问
  6. if ($http_user_agent ~* "CheckMarkNetwork|Synapse|Nimbostratus-Bot|Dark|scraper|LMAO|Hakai|Gemini|Wappalyzer|masscan|crawler4j|Mappy|Center|eright|aiohttp|MauiBot|Crawler|researchscan|Dispatch|AlphaBot|Census|ips-agent|NetcraftSurveyAgent|ToutiaoSpider|EasyHttp|Iframely|sysscan|fasthttp|muhstik|DeuSu|mstshash|HTTP_Request|ExtLinksBot|package|SafeDNSBot|CPython|SiteExplorer|SSH|MegaIndex|BUbiNG|CCBot|NetTrack|Digincore|aiHitBot|SurdotlyBot|null|SemrushBot|Test|Copied|ltx71|Nmap|DotBot|AdsBot|InetURL|Pcore-HTTP|PocketParser|Wotbox|newspaper|DnyzBot|redback|PiplBot|SMTBot|WinHTTP|Auto Spider 1.0|GrabNet|TurnitinBot|Go-Ahead-Got-It|Download Demon|Go!Zilla|GetWeb!|GetRight|libwww-perl|Cliqzbot|MailChimp|SMTBot|Dataprovider|XoviBot|linkdexbot|SeznamBot|Qwantify|spbot|evc-batch|zgrab|Go-http-client|FeedDemon|Jullo|Feedly|YandexBot|oBot|FlightDeckReports|Linguee Bot|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|EasouSpider|LinkpadBot|Ezooms|^$" ) {
  7.      return 403;
  8. }
  9. #禁止非GET|HEAD|POST方式的抓取
  10. if ($request_method !~ ^(GET|HEAD|POST)$) {
  11.     return 403;
  12. }

复制代码

 

【免费分享】屏蔽AI蜘蛛和防止网站文章采集方法

添加完毕后保存,重启nginx即可,这样这些蜘蛛或工具扫描网站的时候就会提示403禁止访问。
注意:如果你网站使用火车头采集发布,使用以上代码会返回403错误,发布不了的。如果想使用火车头采集发布,请使用下面的代码:

  1. #禁止Scrapy等工具的抓取
  2. if ($http_user_agent ~* (Scrapy|Curl|HttpClient|crawl|curb|git|Wtrace)) {
  3.      return 403;
  4. }
  5. #禁止指定UA及UA为空的访问
  6. if ($http_user_agent ~* "CheckMarkNetwork|Synapse|Nimbostratus-Bot|Dark|scraper|LMAO|Hakai|Gemini|Wappalyzer|masscan|crawler4j|Mappy|Center|eright|aiohttp|MauiBot|Crawler|researchscan|Dispatch|AlphaBot|Census|ips-agent|NetcraftSurveyAgent|ToutiaoSpider|EasyHttp|Iframely|sysscan|fasthttp|muhstik|DeuSu|mstshash|HTTP_Request|ExtLinksBot|package|SafeDNSBot|CPython|SiteExplorer|SSH|MegaIndex|BUbiNG|CCBot|NetTrack|Digincore|aiHitBot|SurdotlyBot|null|SemrushBot|Test|Copied|ltx71|Nmap|DotBot|AdsBot|InetURL|Pcore-HTTP|PocketParser|Wotbox|newspaper|DnyzBot|redback|PiplBot|SMTBot|WinHTTP|Auto Spider 1.0|GrabNet|TurnitinBot|Go-Ahead-Got-It|Download Demon|Go!Zilla|GetWeb!|GetRight|libwww-perl|Cliqzbot|MailChimp|SMTBot|Dataprovider|XoviBot|linkdexbot|SeznamBot|Qwantify|spbot|evc-batch|zgrab|Go-http-client|FeedDemon|Jullo|Feedly|YandexBot|oBot|FlightDeckReports|Linguee Bot|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|EasouSpider|LinkpadBot|Ezooms ) {
  7.      return 403;
  8. }
  9. #禁止非GET|HEAD|POST方式的抓取
  10. if ($request_method !~ ^(GET|HEAD|POST)$) {
  11.     return 403;
  12. }

复制代码

设置完了可以用模拟爬去来看看有没有误伤了好蜘蛛,说明:以上屏蔽的蜘蛛名不包括以下常见的6大蜘蛛名:百度蜘蛛:Baiduspider谷歌蜘蛛:Googlebot必应蜘蛛:bingbot搜狗蜘蛛:Sogou web spider360蜘蛛:360Spider神马蜘蛛:YisouSpider爬虫常见的User-Agent如下:

  1. FeedDemon       内容采集
  2. BOT/0.1 (BOT for JCE) sql注入
  3. CrawlDaddy      sql注入
  4. Java         内容采集
  5. Jullo         内容采集
  6. Feedly        内容采集
  7. UniversalFeedParser  内容采集
  8. ApacheBench      cc攻击器
  9. Swiftbot       无用爬虫
  10. YandexBot       无用爬虫
  11. AhrefsBot       无用爬虫
  12. jikeSpider      无用爬虫
  13. MJ12bot        无用爬虫
  14. ZmEu phpmyadmin    漏洞扫描
  15. WinHttp        采集cc攻击
  16. EasouSpider      无用爬虫
  17. HttpClient      tcp攻击
  18. Microsoft URL Control 扫描
  19. YYSpider       无用爬虫
  20. jaunty        wordpress爆破扫描器
  21. oBot         无用爬虫
  22. Python-urllib     内容采集
  23. Indy Library     扫描
  24. FlightDeckReports Bot 无用爬虫
  25. Linguee Bot      无用爬虫

复制代码

 

 

 

声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。