Skip to main content
Build a production-ready cloud Jupyter notebook service that provides notebook execution, rich output rendering, and data science workflows. This cookbook demonstrates how to create a service similar to Kaggle or Google Colab using HopX.

Overview

Cloud Jupyter notebook services allow data scientists to run notebooks in the cloud without local setup. The service executes notebook cells, captures rich outputs (plots, dataframes), handles large datasets, and supports model training workflows. HopX provides the secure execution environment needed for this use case.

Prerequisites

  • HopX API key (Get one here)
  • Python 3.8+ or Node.js 16+
  • Understanding of Jupyter notebook format
  • Basic knowledge of data science workflows

Architecture

┌──────────────┐
│   Notebook   │ Cell execution requests
│     UI       │
└──────┬───────┘


┌─────────────────┐
│  Notebook       │ Parse, execute, capture
│    Service      │
└──────┬──────────┘


┌─────────────────┐
│  HopX Sandbox   │ Secure execution
└──────┬──────────┘


┌─────────────────┐
│  Rich Outputs   │ Plots, DataFrames, HTML
└─────────────────┘

Implementation

Step 1: Notebook Cell Execution

Execute individual notebook cells and capture outputs:
from hopx_ai import Sandbox
import json
import os
from typing import Dict, List, Any

class JupyterNotebookService:
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.sandbox = None
        self.cell_state = {}  # Store variables between cells
    
    def initialize_notebook(self, notebook_id: str) -> Dict[str, Any]:
        """Initialize a new notebook session"""
        try:
            self.sandbox = Sandbox.create(
                template="code-interpreter",
                api_key=self.api_key,
                timeout_seconds=3600  # 1 hour session
            )
            
            # Set up data science environment
            self.sandbox.env.set_all({
                "JUPYTER_MODE": "true",
                "PYTHONPATH": "/workspace"
            })
            
            return {
                "success": True,
                "notebook_id": notebook_id,
                "sandbox_id": self.sandbox.sandbox_id
            }
        except Exception as e:
            return {
                "success": False,
                "error": str(e)
            }
    
    def execute_cell(self, cell_code: str, cell_id: str = None) -> Dict[str, Any]:
        """Execute a notebook cell"""
        try:
            # Use IPython execution for notebook-like behavior
            result = self.sandbox.run_ipython(cell_code)
            
            # Capture outputs
            outputs = []
            if result.rich_outputs:
                for output in result.rich_outputs:
                    outputs.append({
                        "type": output.type,
                        "data": output.data
                    })
            
            # Store cell state
            if cell_id:
                self.cell_state[cell_id] = {
                    "code": cell_code,
                    "outputs": outputs,
                    "stdout": result.stdout,
                    "stderr": result.stderr
                }
            
            return {
                "success": result.success,
                "stdout": result.stdout,
                "stderr": result.stderr,
                "outputs": outputs,
                "output_count": len(outputs),
                "execution_time": result.execution_time
            }
        except Exception as e:
            return {
                "success": False,
                "error": str(e),
                "stderr": str(e)
            }
    
    def execute_notebook(self, notebook_json: Dict) -> Dict[str, Any]:
        """Execute entire notebook"""
        cells = notebook_json.get("cells", [])
        results = []
        
        for i, cell in enumerate(cells):
            if cell.get("cell_type") != "code":
                continue
            
            source = "".join(cell.get("source", []))
            cell_result = self.execute_cell(source, cell_id=f"cell_{i}")
            
            results.append({
                "cell_index": i,
                "result": cell_result
            })
        
        return {
            "success": True,
            "cells_executed": len(results),
            "results": results
        }
    
    def cleanup(self):
        """Clean up notebook session"""
        if self.sandbox:
            self.sandbox.kill()
            self.sandbox = None

# Usage
service = JupyterNotebookService(api_key=os.getenv("HOPX_API_KEY"))
service.initialize_notebook("my-notebook")

# Execute a cell
result = service.execute_cell("""
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'x': np.random.rand(10),
    'y': np.random.rand(10)
})

df  # Display dataframe
""")

print(f"Captured {result['output_count']} outputs")
print(result)

service.cleanup()

Step 2: Rich Output Rendering

Handle plots, dataframes, and other rich outputs:
class RichOutputRenderer:
    def __init__(self, sandbox: Sandbox):
        self.sandbox = sandbox
    
    def render_outputs(self, outputs: List[Dict]) -> List[Dict[str, Any]]:
        """Render rich outputs for display"""
        rendered = []
        
        for output in outputs:
            output_type = output.get("type", "")
            data = output.get("data", {})
            
            if "image/png" in output_type or "image/jpeg" in output_type:
                # Image output
                rendered.append({
                    "type": "image",
                    "format": "png" if "png" in output_type else "jpeg",
                    "data": data.get("image/png") or data.get("image/jpeg"),
                    "encoding": "base64"
                })
            
            elif "text/html" in output_type:
                # HTML output (DataFrames, etc.)
                rendered.append({
                    "type": "html",
                    "data": data.get("text/html", ""),
                    "mime_type": "text/html"
                })
            
            elif "application/json" in output_type:
                # JSON output
                rendered.append({
                    "type": "json",
                    "data": data.get("application/json", {}),
                    "mime_type": "application/json"
                })
            
            else:
                # Plain text
                rendered.append({
                    "type": "text",
                    "data": str(data),
                    "mime_type": "text/plain"
                })
        
        return rendered

# Usage
service = JupyterNotebookService(api_key=os.getenv("HOPX_API_KEY"))
service.initialize_notebook("notebook-1")

# Execute cell with plot
result = service.execute_cell("""
import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 10, 100)
y = np.sin(x)
plt.plot(x, y)
plt.title('Sine Wave')
plt.show()
""")

# Render outputs
renderer = RichOutputRenderer(service.sandbox)
rendered = renderer.render_outputs(result["outputs"])

for output in rendered:
    print(f"Output type: {output['type']}")
    if output['type'] == 'image':
        print(f"  Image data length: {len(output['data'])} bytes")

service.cleanup()

Step 3: Large Dataset Handling

Handle large datasets efficiently:
class LargeDatasetHandler:
    def __init__(self, sandbox: Sandbox):
        self.sandbox = sandbox
    
    def upload_dataset(self, file_path: str, data: bytes) -> Dict[str, Any]:
        """Upload large dataset to sandbox"""
        try:
            # For large files, use upload method
            self.sandbox.files.write(f"/workspace/data/{file_path}", data)
            
            return {
                "success": True,
                "path": f"/workspace/data/{file_path}",
                "size": len(data)
            }
        except Exception as e:
            return {
                "success": False,
                "error": str(e)
            }
    
    def process_large_dataset(self, dataset_path: str, chunk_size: int = 10000) -> Dict[str, Any]:
        """Process large dataset in chunks"""
        try:
            # Process in chunks to avoid memory issues
            code = f"""
import pandas as pd
import os

# Read dataset in chunks
chunk_size = {chunk_size}
dataset_path = '{dataset_path}'

chunks = []
for chunk in pd.read_csv(dataset_path, chunksize=chunk_size):
    # Process chunk
    processed = chunk.describe()
    chunks.append(processed)

# Combine results
result = pd.concat(chunks)
print(f"Processed {{len(chunks)}} chunks")
print(result)
"""
            
            result = self.sandbox.run_code(code, timeout=300)  # 5 minute timeout
            
            return {
                "success": result.success,
                "stdout": result.stdout,
                "chunks_processed": result.stdout.count("chunks") if result.success else 0
            }
        except Exception as e:
            return {
                "success": False,
                "error": str(e)
            }

# Usage
sandbox = Sandbox.create(template="code-interpreter", api_key=os.getenv("HOPX_API_KEY"))
handler = LargeDatasetHandler(sandbox)

# Upload dataset
with open("large_dataset.csv", "rb") as f:
    data = f.read()
    result = handler.upload_dataset("large_dataset.csv", data)
    print(result)

# Process
result = handler.process_large_dataset("/workspace/data/large_dataset.csv")
print(result)

sandbox.kill()

Step 4: Model Training Workflows

Support ML model training:
class ModelTrainingService:
    def __init__(self, sandbox: Sandbox):
        self.sandbox = sandbox
    
    def train_model(self, training_code: str, timeout: int = 1800) -> Dict[str, Any]:
        """Train ML model with progress monitoring"""
        try:
            # Use background execution for long training
            execution_id = self.sandbox.run_code_background(training_code)
            
            # Monitor progress
            import time
            max_wait = timeout
            waited = 0
            
            while waited < max_wait:
                time.sleep(5)  # Check every 5 seconds
                waited += 5
                
                # Check if process is still running
                processes = self.sandbox.list_processes()
                training_process = any(
                    'python' in p.get('name', '').lower() or 
                    'train' in p.get('name', '').lower()
                    for p in processes
                )
                
                if not training_process:
                    # Training completed
                    break
            
            # Check for model file
            if self.sandbox.files.exists("/workspace/model.pkl"):
                model_data = self.sandbox.files.read("/workspace/model.pkl")
                return {
                    "success": True,
                    "model_saved": True,
                    "model_size": len(model_data),
                    "execution_id": execution_id
                }
            else:
                return {
                    "success": True,
                    "model_saved": False,
                    "execution_id": execution_id
                }
                
        except Exception as e:
            return {
                "success": False,
                "error": str(e)
            }

# Usage
sandbox = Sandbox.create(template="code-interpreter", api_key=os.getenv("HOPX_API_KEY"))
trainer = ModelTrainingService(sandbox)

training_code = """
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pickle

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Train model
model = RandomForestClassifier(n_estimators=100)
model.fit(X, y)

# Save model
with open('/workspace/model.pkl', 'wb') as f:
    pickle.dump(model, f)

print("Model trained and saved!")
"""

result = trainer.train_model(training_code)
print(result)

sandbox.kill()

Best Practices

Performance

Use IPython execution mode for notebook cells to automatically capture rich outputs like DataFrames and plots.
  1. Cell State Management: Maintain state between cells for interactive workflows
  2. Output Caching: Cache rendered outputs to avoid re-rendering
  3. Chunked Processing: Process large datasets in chunks
  4. Background Execution: Use background execution for long-running training

Resource Management

  1. Session Timeouts: Set appropriate session timeouts
  2. Memory Monitoring: Monitor memory usage for large datasets
  3. Cleanup: Clean up temporary files and models
  4. Resource Limits: Set limits based on user tier

User Experience

  1. Progress Indicators: Show execution progress for long operations
  2. Error Messages: Provide clear, actionable error messages
  3. Output Formatting: Format outputs for easy reading
  4. Auto-Save: Auto-save notebook state

Real-World Examples

This pattern is used by:
  • Kaggle Notebooks: Data science competition platform
  • Google Colab: Free Jupyter notebook environment
  • Azure Notebooks: Cloud-based Jupyter service
  • Binder: Turn GitHub repos into interactive notebooks

Next Steps

  1. Implement notebook format parsing (Jupyter .ipynb format)
  2. Add support for markdown and code cells
  3. Create a web UI for notebook editing
  4. Implement cell execution queue
  5. Add collaboration features