Build a production-ready web scraping service that automates browsers, captures screenshots, extracts data, and handles rate limiting. This cookbook demonstrates how to create a scraping service using HopX’s desktop automation features.
Overview
Web scraping services automate browser interactions to extract data from websites. The service uses browser automation, captures screenshots, records interactions, and implements ethical scraping practices with rate limiting.
Prerequisites
- HopX API key (Get one here)
- Python 3.8+ or Node.js 16+
- Understanding of web scraping
- Basic knowledge of browser automation
Architecture
┌──────────────┐
│ Scraping │ Define targets
│ Request │
└──────┬───────┘
│
▼
┌─────────────────┐
│ Scraping │ Navigate, extract
│ Service │
└──────┬──────────┘
│
▼
┌─────────────────┐
│ Desktop │ Browser automation
│ Sandbox │
└─────────────────┘
Implementation
Step 1: Basic Web Scraping
Scrape websites using code execution:
from hopx_ai import Sandbox
import os
from typing import Dict, Any
class WebScrapingService:
def __init__(self, api_key: str):
self.api_key = api_key
self.sandbox = None
def initialize(self):
"""Initialize scraping sandbox"""
self.sandbox = Sandbox.create(
template="desktop", # Desktop template for browser automation
api_key=self.api_key,
timeout_seconds=600
)
# Install scraping libraries
self.sandbox.commands.run("pip install requests beautifulsoup4", timeout=60)
def scrape_url(self, url: str, selector: str = None) -> Dict[str, Any]:
"""Scrape data from URL"""
try:
scraping_code = f"""
import requests
from bs4 import BeautifulSoup
response = requests.get('{url}')
soup = BeautifulSoup(response.content, 'html.parser')
if '{selector}':
elements = soup.select('{selector}')
data = [elem.get_text() for elem in elements]
else:
data = soup.get_text()
print(data)
"""
result = self.sandbox.run_code(scraping_code, timeout=30)
return {
"success": result.success,
"data": result.stdout,
"url": url
}
except Exception as e:
return {
"success": False,
"error": str(e)
}
def take_screenshot(self, url: str) -> Dict[str, Any]:
"""Take screenshot of webpage"""
try:
# Navigate to URL (would use desktop automation)
# For now, use code execution
code = f"""
import requests
response = requests.get('{url}')
print(f"Page loaded: {{len(response.content)}} bytes")
"""
result = self.sandbox.run_code(code, timeout=30)
# Take screenshot using desktop features
screenshot = self.sandbox.desktop.screenshot()
return {
"success": True,
"screenshot_size": len(screenshot),
"url": url
}
except Exception as e:
return {
"success": False,
"error": str(e)
}
def cleanup(self):
"""Clean up scraping resources"""
if self.sandbox:
self.sandbox.kill()
self.sandbox = None
# Usage
service = WebScrapingService(api_key=os.getenv("HOPX_API_KEY"))
service.initialize()
result = service.scrape_url("https://example.com", selector="h1")
print(result)
service.cleanup()
Best Practices
Always respect robots.txt, implement rate limiting, and follow website terms of service when scraping.
- Rate Limiting: Implement delays between requests
- Respect robots.txt: Check and follow robots.txt rules
- Error Handling: Handle network errors gracefully
- Data Extraction: Use proper selectors and parsing
Next Steps
- Implement browser automation with desktop features
- Add screenshot and recording capabilities
- Create data extraction workflows
- Implement rate limiting and ethics
- Add proxy support