> ## Documentation Index
> Fetch the complete documentation index at: https://docs.hopx.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Web Scraping Service

> Build a browser automation and scraping service with desktop features, screenshot capture, data extraction workflows, and rate limiting

Build a production-ready web scraping service that automates browsers, captures screenshots, extracts data, and handles rate limiting. This cookbook demonstrates how to create a scraping service using HopX's desktop automation features.

## Overview

Web scraping services automate browser interactions to extract data from websites. The service uses browser automation, captures screenshots, records interactions, and implements ethical scraping practices with rate limiting.

## Prerequisites

* HopX API key ([Get one here](https://console.hopx.dev/api-keys))
* Python 3.8+ or Node.js 16+
* Understanding of web scraping
* Basic knowledge of browser automation

## Architecture

```
┌──────────────┐
│  Scraping    │ Define targets
│   Request    │
└──────┬───────┘
       │
       ▼
┌─────────────────┐
│  Scraping       │ Navigate, extract
│   Service       │
└──────┬──────────┘
       │
       ▼
┌─────────────────┐
│  Desktop        │ Browser automation
│  Sandbox        │
└─────────────────┘
```

## Implementation

### Step 1: Basic Web Scraping

Scrape websites using code execution:

<CodeGroup>
  ```python Python theme={null}
  from hopx_ai import Sandbox
  import os
  from typing import Dict, Any

  class WebScrapingService:
      def __init__(self, api_key: str):
          self.api_key = api_key
          self.sandbox = None
      
      def initialize(self):
          """Initialize scraping sandbox"""
          self.sandbox = Sandbox.create(
              template="desktop",  # Desktop template for browser automation
              api_key=self.api_key,
              timeout_seconds=600
          )
          
          # Install scraping libraries
          self.sandbox.commands.run("pip install requests beautifulsoup4", timeout=60)
      
      def scrape_url(self, url: str, selector: str = None) -> Dict[str, Any]:
          """Scrape data from URL"""
          try:
              scraping_code = f"""
  import requests
  from bs4 import BeautifulSoup

  response = requests.get('{url}')
  soup = BeautifulSoup(response.content, 'html.parser')

  if '{selector}':
      elements = soup.select('{selector}')
      data = [elem.get_text() for elem in elements]
  else:
      data = soup.get_text()

  print(data)
  """
              
              result = self.sandbox.run_code(scraping_code, timeout=30)
              
              return {
                  "success": result.success,
                  "data": result.stdout,
                  "url": url
              }
          except Exception as e:
              return {
                  "success": False,
                  "error": str(e)
              }
      
      def take_screenshot(self, url: str) -> Dict[str, Any]:
          """Take screenshot of webpage"""
          try:
              # Navigate to URL (would use desktop automation)
              # For now, use code execution
              code = f"""
  import requests
  response = requests.get('{url}')
  print(f"Page loaded: {{len(response.content)}} bytes")
  """
              
              result = self.sandbox.run_code(code, timeout=30)
              
              # Take screenshot using desktop features
              screenshot = self.sandbox.desktop.screenshot()
              
              return {
                  "success": True,
                  "screenshot_size": len(screenshot),
                  "url": url
              }
          except Exception as e:
              return {
                  "success": False,
                  "error": str(e)
              }
      
      def cleanup(self):
          """Clean up scraping resources"""
          if self.sandbox:
              self.sandbox.kill()
              self.sandbox = None

  # Usage
  service = WebScrapingService(api_key=os.getenv("HOPX_API_KEY"))
  service.initialize()

  result = service.scrape_url("https://example.com", selector="h1")
  print(result)

  service.cleanup()
  ```

  ```javascript JavaScript theme={null}
  import { Sandbox } from '@hopx-ai/sdk';

  class WebScrapingService {
      constructor(apiKey) {
          this.apiKey = apiKey;
          this.sandbox = null;
      }
      
      async initialize() {
          this.sandbox = await Sandbox.create({
              template: 'desktop',  // Desktop template for browser automation
              apiKey: this.apiKey,
              timeoutSeconds: 600
          });
          
          // Install scraping libraries
          await this.sandbox.commands.run('pip install requests beautifulsoup4', { timeout: 60 });
      }
      
      async scrapeUrl(url, selector = null) {
          try {
              const scrapingCode = `
  import requests
  from bs4 import BeautifulSoup

  response = requests.get('${url}')
  soup = BeautifulSoup(response.content, 'html.parser')

  if '${selector}':
      elements = soup.select('${selector}')
      data = [elem.get_text() for elem in elements]
  else:
      data = soup.get_text()

  print(data)
  `;
              
              const result = await this.sandbox.runCode(scrapingCode, { timeout: 30 });
              
              return {
                  success: result.success,
                  data: result.stdout,
                  url
              };
          } catch (error) {
              return {
                  success: false,
                  error: error.message
              };
          }
      }
      
      async takeScreenshot(url) {
          try {
              // Navigate to URL
              const code = `
  import requests
  response = requests.get('${url}')
  print(f"Page loaded: {len(response.content)} bytes")
  `;
              
              const result = await this.sandbox.runCode(code, { timeout: 30 });
              
              // Take screenshot using desktop features
              const screenshot = await this.sandbox.desktop.screenshot();
              
              return {
                  success: true,
                  screenshotSize: screenshot.length,
                  url
              };
          } catch (error) {
              return {
                  success: false,
                  error: error.message
              };
          }
      }
      
      async cleanup() {
          if (this.sandbox) {
              await this.sandbox.kill();
              this.sandbox = null;
          }
      }
  }

  // Usage
  const service = new WebScrapingService(process.env.HOPX_API_KEY);
  await service.initialize();

  const result = await service.scrapeUrl('https://example.com', 'h1');
  console.log(result);

  await service.cleanup();
  ```
</CodeGroup>

## Best Practices

<Warning>
  Always respect robots.txt, implement rate limiting, and follow website terms of service when scraping.
</Warning>

1. **Rate Limiting**: Implement delays between requests
2. **Respect robots.txt**: Check and follow robots.txt rules
3. **Error Handling**: Handle network errors gracefully
4. **Data Extraction**: Use proper selectors and parsing

## Related Cookbooks

* [API Testing Platform](/cookbooks/automation/api-testing-platform) - Testing automation

## Next Steps

1. Implement browser automation with desktop features
2. Add screenshot and recording capabilities
3. Create data extraction workflows
4. Implement rate limiting and ethics
5. Add proxy support
