In production environments, unauthorized scraping can become a serious threat—ranging from resource exhaustion to data exfiltration. Tools like Scrapy, curl, or python-requests are commonly used to perform automated crawling that often ignores robots.txt directives.

This article explains how to block such bots in OpenLiteSpeed, Apache, and Nginx, using User-Agent and IP-based filtering, including stricter validation of legitimate Googlebot requests.

🔧 Why Block Bots?

Blocking automated traffic helps:

Reduce CPU and bandwidth consumption.
Prevent scraping-related data leaks.
Minimize database load and system stress.

The key lies in applying server-level filters, rather than relying solely on external services.

🗂️ Blocking in Apache and OpenLiteSpeed via `.htaccess`

On Apache or OpenLiteSpeed servers with .htaccess enabled, use the following rewrite rules:

# Recommendations for blocking bots overloading your site
# Courtesy of https://wpdirecto.com and https://administraciondesistemas.com

RewriteEngine On

# Block common scraping tools
RewriteCond %{HTTP_USER_AGENT} (scrapy|python-requests|curl|wget|libwww|httpunit|nutch) [NC]
RewriteRule ^.* - [F,L]
Code language: PHP (php)

📌 Notes for OpenLiteSpeed

In WebAdmin → Virtual Hosts → Your Domain → Context → /, ensure Allow Override is set to Yes.
Restart the server to apply changes:

/usr/local/lsws/bin/lswsctrl restart

⚙️ Configuration for Nginx

Nginx doesn’t use .htaccess, so you must edit nginx.conf or relevant server blocks directly:

server {
    ...

    # Block scraping tools
    if ($http_user_agent ~* (scrapy|python-requests|curl|wget|libwww|httpunit|nutch)) {
        return 403;
    }

    ...
}
Code language: PHP (php)

After changes, validate and reload Nginx:

sudo nginx -t && sudo systemctl reload nginx

🔍 Test Your Configuration

Simulate a Scrapy request using curl:

curl -A "Scrapy/2.9.0 (+https://scrapy.org)" -I https://systemadministration.net
Code language: JavaScript (javascript)

Expected result:

HTTP/1.1 403 Forbidden
Code language: HTTP (http)

🛡️ Additional Tips

Cloudflare WAF: Block suspicious User-Agents before they reach your infrastructure.
Rate Limiting: Throttle IPs with excessive requests per second.
iptables/firewalld: Block entire IP ranges when persistent abuse is detected.
ModSecurity (Apache/Nginx): Use OWASP Core Rule Set for broader filtering.

✅ Conclusion

Blocking abusive scraping bots is critical to keeping your infrastructure secure and efficient. Whether you use OpenLiteSpeed, Apache, or Nginx, these server-level rules provide powerful protection. Combine them with observability, threat intelligence, and CDN-level security for even stronger resilience.