Skip to main content
Build a production-ready web scraping service that automates browsers, captures screenshots, extracts data, and handles rate limiting. This cookbook demonstrates how to create a scraping service using HopX’s desktop automation features.

Overview

Web scraping services automate browser interactions to extract data from websites. The service uses browser automation, captures screenshots, records interactions, and implements ethical scraping practices with rate limiting.

Prerequisites

  • HopX API key (Get one here)
  • Python 3.8+ or Node.js 16+
  • Understanding of web scraping
  • Basic knowledge of browser automation

Architecture

┌──────────────┐
│  Scraping    │ Define targets
│   Request    │
└──────┬───────┘


┌─────────────────┐
│  Scraping       │ Navigate, extract
│   Service       │
└──────┬──────────┘


┌─────────────────┐
│  Desktop        │ Browser automation
│  Sandbox        │
└─────────────────┘

Implementation

Step 1: Basic Web Scraping

Scrape websites using code execution:
from hopx_ai import Sandbox
import os
from typing import Dict, Any

class WebScrapingService:
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.sandbox = None
    
    def initialize(self):
        """Initialize scraping sandbox"""
        self.sandbox = Sandbox.create(
            template="desktop",  # Desktop template for browser automation
            api_key=self.api_key,
            timeout_seconds=600
        )
        
        # Install scraping libraries
        self.sandbox.commands.run("pip install requests beautifulsoup4", timeout=60)
    
    def scrape_url(self, url: str, selector: str = None) -> Dict[str, Any]:
        """Scrape data from URL"""
        try:
            scraping_code = f"""
import requests
from bs4 import BeautifulSoup

response = requests.get('{url}')
soup = BeautifulSoup(response.content, 'html.parser')

if '{selector}':
    elements = soup.select('{selector}')
    data = [elem.get_text() for elem in elements]
else:
    data = soup.get_text()

print(data)
"""
            
            result = self.sandbox.run_code(scraping_code, timeout=30)
            
            return {
                "success": result.success,
                "data": result.stdout,
                "url": url
            }
        except Exception as e:
            return {
                "success": False,
                "error": str(e)
            }
    
    def take_screenshot(self, url: str) -> Dict[str, Any]:
        """Take screenshot of webpage"""
        try:
            # Navigate to URL (would use desktop automation)
            # For now, use code execution
            code = f"""
import requests
response = requests.get('{url}')
print(f"Page loaded: {{len(response.content)}} bytes")
"""
            
            result = self.sandbox.run_code(code, timeout=30)
            
            # Take screenshot using desktop features
            screenshot = self.sandbox.desktop.screenshot()
            
            return {
                "success": True,
                "screenshot_size": len(screenshot),
                "url": url
            }
        except Exception as e:
            return {
                "success": False,
                "error": str(e)
            }
    
    def cleanup(self):
        """Clean up scraping resources"""
        if self.sandbox:
            self.sandbox.kill()
            self.sandbox = None

# Usage
service = WebScrapingService(api_key=os.getenv("HOPX_API_KEY"))
service.initialize()

result = service.scrape_url("https://example.com", selector="h1")
print(result)

service.cleanup()

Best Practices

Always respect robots.txt, implement rate limiting, and follow website terms of service when scraping.
  1. Rate Limiting: Implement delays between requests
  2. Respect robots.txt: Check and follow robots.txt rules
  3. Error Handling: Handle network errors gracefully
  4. Data Extraction: Use proper selectors and parsing

Next Steps

  1. Implement browser automation with desktop features
  2. Add screenshot and recording capabilities
  3. Create data extraction workflows
  4. Implement rate limiting and ethics
  5. Add proxy support