In production environments, unauthorized scraping can become a serious threat—ranging from resource exhaustion to data exfiltration. Tools like Scrapy, curl
, or python-requests
are commonly used to perform automated crawling that often ignores robots.txt
directives.
This article explains how to block such bots in OpenLiteSpeed, Apache, and Nginx, using User-Agent and IP-based filtering, including stricter validation of legitimate Googlebot requests.
🔧 Why Block Bots?
Blocking automated traffic helps:
- Reduce CPU and bandwidth consumption.
- Prevent scraping-related data leaks.
- Minimize database load and system stress.
The key lies in applying server-level filters, rather than relying solely on external services.
🗂️ Blocking in Apache and OpenLiteSpeed via .htaccess
On Apache or OpenLiteSpeed servers with .htaccess
enabled, use the following rewrite rules:
# Recommendations for blocking bots overloading your site
# Courtesy of https://wpdirecto.com and https://administraciondesistemas.com
RewriteEngine On
# Block empty User-Agent
RewriteCond %{HTTP_USER_AGENT} ^-?$
RewriteRule ^.* - [F,L]
# Block common scraping tools
RewriteCond %{HTTP_USER_AGENT} (scrapy|httpclient|python-requests|curl|wget|libwww|httpunit|nutch) [NC]
RewriteRule ^.* - [F,L]
# Optionally block less common methods
RewriteCond %{REQUEST_METHOD} ^(HEAD|OPTIONS)$
RewriteRule ^.* - [F,L]
# Block fake Googlebot (accept only known IP ranges)
RewriteCond %{HTTP_USER_AGENT} "Googlebot" [NC]
RewriteCond %{REMOTE_ADDR} !^66\.249\.
RewriteCond %{REMOTE_ADDR} !^192\.178\.
RewriteCond %{REMOTE_ADDR} !^34\.(100|101|118|126|147|151|152|154|155|165|175|176|22|64|65|80|88|89|96)\.
RewriteRule ^.* - [F,L]
Code language: PHP (php)
📌 Notes for OpenLiteSpeed
- In WebAdmin → Virtual Hosts → Your Domain → Context →
/
, ensureAllow Override
is set toYes
. - Restart the server to apply changes:
/usr/local/lsws/bin/lswsctrl restart
⚙️ Configuration for Nginx
Nginx doesn’t use .htaccess
, so you must edit nginx.conf
or relevant server
blocks directly:
server {
...
# Block empty User-Agent
if ($http_user_agent = "") {
return 403;
}
# Block scraping tools
if ($http_user_agent ~* (scrapy|httpclient|python-requests|curl|wget|libwww|httpunit|nutch)) {
return 403;
}
# Validate real Googlebot IPs
if ($http_user_agent ~* "Googlebot") {
if ($remote_addr !~ ^66\.249\. &&
$remote_addr !~ ^192\.178\. &&
$remote_addr !~ ^34\.(100|101|118|126|147|151|152|154|155|165|175|176|22|64|65|80|88|89|96)\.) {
return 403;
}
}
# Optional: block uncommon request methods
if ($request_method ~* (HEAD|OPTIONS)) {
return 403;
}
...
}
Code language: PHP (php)
After changes, validate and reload Nginx:
sudo nginx -t && sudo systemctl reload nginx
🔍 Test Your Configuration
Simulate a Scrapy request using curl
:
curl -A "Scrapy/2.9.0 (+https://scrapy.org)" -I https://systemadministration.net
Code language: JavaScript (javascript)
Expected result:
HTTP/1.1 403 Forbidden
Code language: HTTP (http)
🛡️ Additional Tips
- Cloudflare WAF: Block suspicious User-Agents before they reach your infrastructure.
- Rate Limiting: Throttle IPs with excessive requests per second.
- iptables/firewalld: Block entire IP ranges when persistent abuse is detected.
- ModSecurity (Apache/Nginx): Use OWASP Core Rule Set for broader filtering.
✅ Conclusion
Blocking abusive scraping bots is critical to keeping your infrastructure secure and efficient. Whether you use OpenLiteSpeed, Apache, or Nginx, these server-level rules provide powerful protection. Combine them with observability, threat intelligence, and CDN-level security for even stronger resilience.