In production environments, unauthorized scraping can become a serious threat—ranging from resource exhaustion to data exfiltration. Tools like Scrapy, curl, or python-requests are commonly used to perform automated crawling that often ignores robots.txt directives.

This article explains how to block such bots in OpenLiteSpeed, Apache, and Nginx, using User-Agent and IP-based filtering, including stricter validation of legitimate Googlebot requests.


🔧 Why Block Bots?

Blocking automated traffic helps:

  • Reduce CPU and bandwidth consumption.
  • Prevent scraping-related data leaks.
  • Minimize database load and system stress.

The key lies in applying server-level filters, rather than relying solely on external services.


🗂️ Blocking in Apache and OpenLiteSpeed via .htaccess

On Apache or OpenLiteSpeed servers with .htaccess enabled, use the following rewrite rules:

# Recommendations for blocking bots overloading your site
# Courtesy of https://wpdirecto.com and https://administraciondesistemas.com

RewriteEngine On

# Block empty User-Agent
RewriteCond %{HTTP_USER_AGENT} ^-?$
RewriteRule ^.* - [F,L]

# Block common scraping tools
RewriteCond %{HTTP_USER_AGENT} (scrapy|httpclient|python-requests|curl|wget|libwww|httpunit|nutch) [NC]
RewriteRule ^.* - [F,L]

# Optionally block less common methods
RewriteCond %{REQUEST_METHOD} ^(HEAD|OPTIONS)$
RewriteRule ^.* - [F,L]

# Block fake Googlebot (accept only known IP ranges)
RewriteCond %{HTTP_USER_AGENT} "Googlebot" [NC]
RewriteCond %{REMOTE_ADDR} !^66\.249\.
RewriteCond %{REMOTE_ADDR} !^192\.178\.
RewriteCond %{REMOTE_ADDR} !^34\.(100|101|118|126|147|151|152|154|155|165|175|176|22|64|65|80|88|89|96)\.
RewriteRule ^.* - [F,L]
Code language: PHP (php)

📌 Notes for OpenLiteSpeed

  • In WebAdmin → Virtual HostsYour DomainContext/, ensure Allow Override is set to Yes.
  • Restart the server to apply changes:
/usr/local/lsws/bin/lswsctrl restart

⚙️ Configuration for Nginx

Nginx doesn’t use .htaccess, so you must edit nginx.conf or relevant server blocks directly:

server {
    ...

    # Block empty User-Agent
    if ($http_user_agent = "") {
        return 403;
    }

    # Block scraping tools
    if ($http_user_agent ~* (scrapy|httpclient|python-requests|curl|wget|libwww|httpunit|nutch)) {
        return 403;
    }

    # Validate real Googlebot IPs
    if ($http_user_agent ~* "Googlebot") {
        if ($remote_addr !~ ^66\.249\. &&
            $remote_addr !~ ^192\.178\. &&
            $remote_addr !~ ^34\.(100|101|118|126|147|151|152|154|155|165|175|176|22|64|65|80|88|89|96)\.) {
            return 403;
        }
    }

    # Optional: block uncommon request methods
    if ($request_method ~* (HEAD|OPTIONS)) {
        return 403;
    }

    ...
}
Code language: PHP (php)

After changes, validate and reload Nginx:

sudo nginx -t && sudo systemctl reload nginx

🔍 Test Your Configuration

Simulate a Scrapy request using curl:

curl -A "Scrapy/2.9.0 (+https://scrapy.org)" -I https://systemadministration.net
Code language: JavaScript (javascript)

Expected result:

HTTP/1.1 403 Forbidden
Code language: HTTP (http)

🛡️ Additional Tips

  • Cloudflare WAF: Block suspicious User-Agents before they reach your infrastructure.
  • Rate Limiting: Throttle IPs with excessive requests per second.
  • iptables/firewalld: Block entire IP ranges when persistent abuse is detected.
  • ModSecurity (Apache/Nginx): Use OWASP Core Rule Set for broader filtering.

✅ Conclusion

Blocking abusive scraping bots is critical to keeping your infrastructure secure and efficient. Whether you use OpenLiteSpeed, Apache, or Nginx, these server-level rules provide powerful protection. Combine them with observability, threat intelligence, and CDN-level security for even stronger resilience.

Scroll to Top