Web Scraping Service

Build a production-ready web scraping service that automates browsers, captures screenshots, extracts data, and handles rate limiting. This cookbook demonstrates how to create a scraping service using HopX’s desktop automation features.

Overview

Web scraping services automate browser interactions to extract data from websites. The service uses browser automation, captures screenshots, records interactions, and implements ethical scraping practices with rate limiting.

Prerequisites

HopX API key (Get one here)
Python 3.8+ or Node.js 16+
Understanding of web scraping
Basic knowledge of browser automation

Architecture

┌──────────────┐
│  Scraping    │ Define targets
│   Request    │
└──────┬───────┘
       │
       ▼
┌─────────────────┐
│  Scraping       │ Navigate, extract
│   Service       │
└──────┬──────────┘
       │
       ▼
┌─────────────────┐
│  Desktop        │ Browser automation
│  Sandbox        │
└─────────────────┘

Implementation

Step 1: Basic Web Scraping

Scrape websites using code execution:

from hopx_ai import Sandbox
import os
from typing import Dict, Any

class WebScrapingService:
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.sandbox = None
    
    def initialize(self):
        """Initialize scraping sandbox"""
        self.sandbox = Sandbox.create(
            template="desktop",  # Desktop template for browser automation
            api_key=self.api_key,
            timeout_seconds=600
        )
        
        # Install scraping libraries
        self.sandbox.commands.run("pip install requests beautifulsoup4", timeout=60)
    
    def scrape_url(self, url: str, selector: str = None) -> Dict[str, Any]:
        """Scrape data from URL"""
        try:
            scraping_code = f"""
import requests
from bs4 import BeautifulSoup

response = requests.get('{url}')
soup = BeautifulSoup(response.content, 'html.parser')

if '{selector}':
    elements = soup.select('{selector}')
    data = [elem.get_text() for elem in elements]
else:
    data = soup.get_text()

print(data)
"""
            
            result = self.sandbox.run_code(scraping_code, timeout=30)
            
            return {
                "success": result.success,
                "data": result.stdout,
                "url": url
            }
        except Exception as e:
            return {
                "success": False,
                "error": str(e)
            }
    
    def take_screenshot(self, url: str) -> Dict[str, Any]:
        """Take screenshot of webpage"""
        try:
            # Navigate to URL (would use desktop automation)
            # For now, use code execution
            code = f"""
import requests
response = requests.get('{url}')
print(f"Page loaded: {{len(response.content)}} bytes")
"""
            
            result = self.sandbox.run_code(code, timeout=30)
            
            # Take screenshot using desktop features
            screenshot = self.sandbox.desktop.screenshot()
            
            return {
                "success": True,
                "screenshot_size": len(screenshot),
                "url": url
            }
        except Exception as e:
            return {
                "success": False,
                "error": str(e)
            }
    
    def cleanup(self):
        """Clean up scraping resources"""
        if self.sandbox:
            self.sandbox.kill()
            self.sandbox = None

# Usage
service = WebScrapingService(api_key=os.getenv("HOPX_API_KEY"))
service.initialize()

result = service.scrape_url("https://example.com", selector="h1")
print(result)

service.cleanup()

Best Practices

Always respect robots.txt, implement rate limiting, and follow website terms of service when scraping.

Rate Limiting: Implement delays between requests
Respect robots.txt: Check and follow robots.txt rules
Error Handling: Handle network errors gracefully
Data Extraction: Use proper selectors and parsing

API Testing Platform - Testing automation

Next Steps

Implement browser automation with desktop features
Add screenshot and recording capabilities
Create data extraction workflows
Implement rate limiting and ethics
Add proxy support

AI & LLM Integration

Educational Platforms

Development Tools

Data Science & Analytics

Testing & CI/CD

Automation & Workflows

Serverless & Edge

Marketplace & Plugins

Enterprise & SaaS

Overview

Prerequisites

Architecture

Implementation

Step 1: Basic Web Scraping

Best Practices

Next Steps

AI & LLM Integration

Educational Platforms

Development Tools

Data Science & Analytics

Testing & CI/CD

Automation & Workflows

Serverless & Edge

Marketplace & Plugins

Enterprise & SaaS

​Overview

​Prerequisites

​Architecture

​Implementation

​Step 1: Basic Web Scraping

​Best Practices

​Related Cookbooks

​Next Steps

Overview

Prerequisites

Architecture

Implementation

Step 1: Basic Web Scraping

Best Practices

Related Cookbooks

Next Steps