A generic web page scraping microservice with browser automation, featuring browser pooling and configurable site support. Designed to bypass Cloudflare protection and handle dynamic content. Includes comprehensive test coverage with enhanced Cloudflare detection and containerized testing infrastructure.
- Generic Scraping: Configurable selectors for any website
- Browser Pool: Pre-launched browsers for instant responses (3-5 second scraping vs 15+ seconds)
- Site Configurations: Pre-built configs for common sites (MFC, extensible to others)
- Cloudflare Bypass: Real Chromium browsers with fresh sessions per request
- MFC NSFW Authentication: Support for authenticated scraping with user's own session cookies
- Stealth Mode: Anti-detection for authenticated requests (bypasses Cloudflare bot protection)
- Robust Error Handling: Handles timeouts, challenges, and extraction failures
- RESTful API: Simple HTTP interface with both generic and site-specific endpoints
- Docker Ready: Optimized container with all browser dependencies
- Comprehensive Testing: Multi-suite test coverage with Jest, Puppeteer mocking, and containerized test execution
This scraper is designed for personal data management and legitimate collection organization:
✅ Authorized Use Cases:
- Scraping your own user data from websites where you have an account
- Managing personal figure collections with enhanced organization
- Aggregating content you own or have permission to access
- Educational research and personal archival
- Building better UIs for your own data
❌ Prohibited Use Cases:
- Scraping copyrighted content for redistribution
- Bypassing paywalls or authentication for unauthorized access
- Bulk data harvesting for competitive purposes
- Automated scraping that violates a site's Terms of Service
- Any use that could harm the target website or its users
The NSFW authentication feature uses stealth browser technology to bypass Cloudflare's bot detection. This functionality is provided exclusively for users to access their own authenticated content:
- User's Own Data: Only scrape figures visible to the authenticated user
- Personal Use: For organizing and managing the user's own collection
- Session Cookies: User provides their own valid session cookies
- No Credential Storage: Cookies are time-limited bearer tokens, not permanent credentials
- Respects Permissions: User can only access content allowed by their MFC account settings
Privacy Model: Similar to how Plex manages your movie library or Calibre organizes your ebooks - this tool helps you better organize content you legitimately own or have access to.
By using this service, you agree to:
- Only scrape content you have permission to access
- Comply with all applicable Terms of Service
- Respect robots.txt and rate limiting
- Use scraped data only for personal, non-commercial purposes
- Not redistribute scraped copyrighted content
This software is provided for legitimate personal use only. Users are solely responsible for ensuring their use complies with applicable laws and website terms of service.
Generic scraping with custom configuration.
Request Body:
{
"url": "https://example.com/item/123",
"config": {
"imageSelector": ".product-image img",
"manufacturerSelector": ".brand-name",
"nameSelector": ".product-title",
"scaleSelector": ".scale-info",
"waitTime": 2000
}
}Convenience endpoint for MyFigureCollection (uses pre-built config).
Request Body (Public Content):
{
"url": "https://myfigurecollection.net/item/597971"
}Request Body (NSFW Content with Authentication):
{
"url": "https://myfigurecollection.net/item/422432",
"config": {
"mfcAuth": {
"sessionCookies": {
"PHPSESSID": "your_session_id",
"sesUID": "your_user_id",
"TBv4_Iden": "your_user_id",
"TBv4_Hash": "your_hash_value"
}
}
}
}How to Get MFC Session Cookies:
- Log into MyFigureCollection in your browser
- Open DevTools (F12) → Application/Storage → Cookies
- Find
myfigurecollection.netdomain - Copy the four required cookie values
⚠️ Security: Cookies expire (typically monthly), treat like passwords
Note: NSFW scraping uses stealth browser mode to bypass Cloudflare protection and requires valid authentication cookies from your own MFC account.
Response (both endpoints):
{
"success": true,
"data": {
"imageUrl": "https://images.goodsmile.info/...",
"manufacturer": "Good Smile Company",
"name": "Nendoroid Hatsune Miku",
"scale": "1/1"
}
}Get available pre-built site configurations.
Response:
{
"success": true,
"data": {
"mfc": {
"imageSelector": ".item-picture .main img",
"manufacturerSelector": "span[switch]",
"nameSelector": "span[switch]:nth-of-type(2)",
"scaleSelector": ".item-scale a[title=\"Scale\"]"
}
}
}Health check endpoint for monitoring.
Get service version information for version management.
Response:
{
"name": "scraper",
"version": "1.0.0",
"status": "healthy"
}Manually reset the browser pool for testing or emergency situations.
Security:
- Environment Protection: Only registered in non-production environments
- Authentication Required: Must provide valid
x-admin-tokenheader - Async Operation: Properly closes all browsers before resetting
Request Headers:
x-admin-token: <admin-token-value>
Response (Success):
{
"success": true,
"message": "Browser pool reset successfully"
}Response (Unauthorized):
{
"success": false,
"message": "Forbidden"
}Features:
- Clears all existing browser instances safely
- Recreates the browser pool
- Useful for manual browser pool management during testing
- Can be used to mitigate Cloudflare detection issues
Use Cases:
- Force browser pool refresh during testing
- Reset pool after detecting browser fingerprinting changes
- Emergency recovery from browser cache/session issues in test environments
Get all active sessions with their status.
Response:
{
"success": true,
"data": {
"sessions": [
{
"sessionId": "abc12345...",
"isPaused": true,
"consecutiveFailures": 3,
"failedMfcIds": ["123456", "789012"],
"inCooldown": false,
"cooldownRemainingMs": 0
}
],
"count": 1,
"pausedCount": 1,
"inCooldownCount": 0
}
}Resume a paused session to continue processing.
Response:
{
"success": true,
"message": "Session resumed, processing will continue"
}Cancel all failed items for a session (removes them from queue).
Response:
{
"success": true,
"message": "Cancelled 3 failed items",
"data": { "cancelledCount": 3 }
}Get detailed queue statistics for monitoring.
Response:
{
"success": true,
"data": {
"queues": { "hot": 10, "warm": 5, "cold": 100 },
"total": 115,
"processing": 1,
"completed": 50,
"failed": 2,
"rateLimit": {
"active": false,
"currentDelayMs": 3000
}
}
}The scraper includes comprehensive test coverage with enhanced testing infrastructure and containerized test execution.
- Total Test Suites: 10 test suites
- Total Tests: 215 passing tests
- Code Coverage: 80%+ (Codecov quality gate)
- Testing Framework: Jest + TypeScript + Supertest
- Mocking Strategy: Complete Puppeteer API mocking
- Containerized Testing: Docker-based test execution with coverage extraction
- Enhanced Cloudflare Detection: Dedicated test suite for Cloudflare bypass validation
src/__tests__/
├── unit/
│ ├── genericScraper.test.ts # Core scraping functionality
│ ├── browserPool.test.ts # Browser pool management
│ ├── puppeteerAutomation.test.ts # Browser automation
│ ├── errorHandling.test.ts # Error scenarios
│ ├── mfcScraping.test.ts # MFC-specific tests
│ ├── performance.test.ts # Performance benchmarks
│ └── cloudflareDetection.test.ts # Enhanced Cloudflare detection
└── integration/
├── scraperRoutes.test.ts # API endpoint tests
└── inter-service/
└── backendCommunication.test.ts # Cross-service communication
Unit Tests (7 suites):
- Generic Scraper: SITE_CONFIGS validation, scraping logic, error handling
- Browser Pool: Pool management, concurrency, memory management
- Puppeteer Automation: Browser configuration, navigation, data extraction
- Error Handling: Network failures, timeouts, resource issues
- MFC Scraping: MFC-specific functionality and edge cases
- Performance: Response time benchmarks and efficiency tests
- Cloudflare Detection: Enhanced Cloudflare bypass validation and fuzzy matching
Integration Tests (1 suite):
- API Routes: All HTTP endpoints with various scenarios
Complete Puppeteer Mocking:
// Mock browser and page instances
const mockBrowser = {
newPage: jest.fn(),
close: jest.fn()
};
const mockPage = {
goto: jest.fn(),
evaluate: jest.fn(),
close: jest.fn(),
setViewport: jest.fn(),
setUserAgent: jest.fn()
};Performance Testing:
// Example: Testing response time targets
it('should complete scraping within 5 seconds', async () => {
const startTime = Date.now();
await genericScraper.scrape(testUrl, config);
const duration = Date.now() - startTime;
expect(duration).toBeLessThan(5000);
});Error Scenario Testing:
// Example: Testing browser failure handling
it('should handle browser launch failure', async () => {
mockPuppeteer.launch.mockRejectedValue(new Error('Browser launch failed'));
await expect(browserPool.getBrowser())
.rejects
.toThrow('Browser launch failed');
});# WSL Setup Required: Install Node.js via NVM (see ../WSL_TEST_FIX_SOLUTION.md)
# Install dependencies
npm install
# Run all tests
npm test
# Run with coverage report
npm run test:coverage
# Run in watch mode (development)
npm run test:watch
# Run CI tests (no watch)
npm run test:ci
# Run containerized tests with coverage extraction
./test-container-coverage.sh
# Run specific test suite
npx jest src/__tests__/unit/genericScraper.test.ts
# Run tests matching pattern
npx jest --testNamePattern="MFC scraping"TypeScript Test Configuration (tsconfig.test.json):
{
"extends": "./tsconfig.json",
"compilerOptions": {
"strict": false, // Relaxed type checking for tests
"noImplicitAny": false, // Allow implicit 'any' types
"strictNullChecks": false, // More flexible null handling
"skipLibCheck": true, // Skip type checking of declaration files
"types": ["jest", "node"] // Include Jest and Node types
},
"include": [
"src/**/__tests__/**/*", // Include all test files
"src/**/__mocks__/**/*" // Include mock implementations
]
}Jest Configuration (jest.config.js):
module.exports = {
preset: 'ts-jest',
testEnvironment: 'node',
roots: ['<rootDir>/src'],
testMatch: [
'**/__tests__/**/*.test.ts',
'**/?(*.)+(spec|test).ts'
],
testPathIgnorePatterns: [
'/node_modules/',
'/__tests__/__mocks__/',
'/__tests__/fixtures/',
'/__tests__/setup.ts'
],
transform: {
'^.+\.ts$': ['ts-jest', {
tsconfig: '<rootDir>/tsconfig.test.json',
diagnostics: { warnOnly: true }
}]
},
collectCoverageFrom: [
'src/**/*.ts',
'!src/**/*.d.ts',
'!src/index.ts'
],
coverageDirectory: 'coverage',
coverageReporters: ['text', 'lcov', 'html'],
setupFilesAfterEnv: ['<rootDir>/src/__tests__/setup.ts'],
testTimeout: 30000,
maxWorkers: 4,
// Enhanced Puppeteer Mocking
moduleNameMapper: {
'^puppeteer$': '<rootDir>/src/__tests__/__mocks__/puppeteer.ts'
},
// Comprehensive Mock Management
clearMocks: true,
resetMocks: true,
restoreMocks: true,
// Performance and Stability Enhancements
bail: false,
verbose: true
};Key Testing Improvements:
- Introduced
tsconfig.test.jsonfor more flexible test compilation - Relaxed TypeScript strict mode for easier test writing
- Added comprehensive type configuration for Jest and Node.js
- Improved mock type handling to reduce compilation friction
- Enhanced test file discovery and coverage reporting
- Added containerized testing with
test-container-coverage.shscript - Enhanced Cloudflare detection testing with fuzzy matching validation
- Cross-service communication validation tests
Target Metrics:
- Response Time: 3-5 seconds per scraping operation
- Concurrent Capacity: 10+ simultaneous requests
- Browser Pool Efficiency: <1 second pool operations
- Memory Management: Proper cleanup after each operation
MFC Bulk Import Enhancements (Latest):
- Cookie Passthrough for All Items: Cookies are now passed for ALL items during bulk sync, not just NSFW content
- Required for accessing user-specific data (collection status, prices, ownership info)
- Ensures consistent authentication across the entire sync operation
- Development Server (tsx): Switched from
ts-node-devtotsxfor faster startup- Uses esbuild for near-instant TypeScript compilation
- Automatic
.envfile loading via--env-fileflag - Hot reload with
tsx watchfor seamless development
Security Enhancements:
- Protected
/reset-poolendpoint with authentication (x-admin-token) - Conditional endpoint registration (not available in production)
- Async browser cleanup in
BrowserPool.reset() - Removed sensitive error details from API responses
- Enhanced Docker security (explicit file copying, no recursive COPY)
Docker Production Improvements:
- Fixed Chromium executable path for Alpine Linux (/usr/bin/chromium-browser)
- Dynamic healthcheck respects PORT environment variable
- Removed build fallback for fail-fast behavior
- Fixed .dockerignore to not exclude Dockerfiles from build context
- Added writable home directory for non-root user (Chromium requirement)
- Improved healthcheck security (no shell substitution)
Test Coverage Improvements:
- Achieved 80%+ code coverage (Codecov quality gate)
- Added comprehensive test suites for all routes
- Enhanced security testing for protected endpoints
- Improved mock implementations for async operations
- Better test isolation using
jest.isolateModules()instead ofjest.resetModules() - Added try-finally blocks for guaranteed environment cleanup
BrowserPool Enhancements:
- Improved concurrency management
- Enhanced Cloudflare detection mechanism
- Optimized static state reset for better test isolation
- Proper async cleanup of browser resources
Concurrency Management Strategy:
// New BrowserPool concurrency control
const browserPool = new ConcurrentBrowserPool({
maxConcurrent: 10, // Configurable concurrent browser limit
maxQueueSize: 50, // Prevent overwhelming browser resources
timeoutMs: 30000 // Configurable request timeout
});HTML Fixtures:
const MFC_FIGURE_HTML = `
<div class="item-picture">
<img src="https://images.goodsmile.info/test.jpg" alt="Test Figure">
</div>
<div class="item-details">
<span switch="Company">Test Company</span>
<span switch="Character">Test Character</span>
</div>
`;# CI test command
NODE_ENV=test npm run test:ci
# Coverage reporting for CI
NODE_ENV=test npm run test:coverage
# Containerized testing (isolates dependencies)
./test-container-coverage.shThe service includes a containerized testing script that runs all tests in a Docker environment:
# Run tests in isolated Docker container
./test-container-coverage.shFeatures:
- Isolated test environment with all dependencies
- Automated coverage report extraction
- Cross-platform compatibility
- Automatic browser opening of coverage reports (when available)
- Test results exported to
./test-results/directory
Output:
- Coverage reports:
./test-results/coverage/lcov-report/index.html - Test results:
./test-results/reports/
See TESTING.md for comprehensive testing documentation including:
- Complete test strategy and methodology
- Detailed coverage breakdown
- Performance benchmarking
- Mock data and fixtures
- Maintenance guidelines
Configuration Files:
.env.example- Template showing all environment variables.env- Your local configuration (gitignored, never commit this!)
Quick Start:
# Copy example (optional - defaults work for most cases)
cp .env.example .env
# Scraper typically works with defaults - no secrets required!See .env.example for all configuration options including:
- Server port configuration
- Puppeteer Chrome path (for CI/CD)
- Admin token (for /reset-pool endpoint)
- Debug logging settings
# Install dependencies
npm install
# Start development server (uses tsx for fast startup)
npm run dev
# Build for production
npm run build
# Start production server
npm start
# Run tests in development
npm run test:watchThe build process generates JavaScript files and source maps:
routes/- Compiled route handlersservices/- Compiled service modulesindex.js- Main application entry point- Source maps (
.js.map) for debugging compiled code
# Watch mode for continuous testing
npm run test:watch
# Test specific functionality
npx jest browserPool --watch
# Performance testing
npx jest performance.test.tsThe service uses a multi-stage Dockerfile with the following build targets:
# Development (with hot reload, port 3080)
docker build --target development -t scraper:dev .
docker run -p 3080:3080 -e PORT=3080 --shm-size=2gb scraper:dev
# Test environment (port 3070)
docker build --target test -t scraper:test .
docker run -p 3070:3070 -e PORT=3070 --shm-size=2gb scraper:test
# Production (default, port 3050)
docker build -t scraper:prod .
docker run -p 3050:3050 -e PORT=3050 --shm-size=2gb scraper:prodAvailable stages:
base: Alpine Linux with Chromium and Puppeteer dependenciesdevelopment: Includes devDependencies and nodemon for hot reloadtest: Test environment for CI/CDbuilder: Compiles TypeScript to JavaScriptproduction: Optimized image with production dependencies only (default)
Note: --shm-size=2gb is required for Puppeteer to avoid memory issues with Chromium.
See .env.example for complete configuration template.
Required:
PORT: Server port (prod: 3050, local dev: 3080, test: 3070, Coolify dev: 3090)NODE_ENV: Environment mode (development, test, production)
Required in Docker/Production:
BACKEND_URL: Backend service URL for webhook callbacks during MFC sync- Docker prod:
http://backend:5050 - Docker Coolify dev:
http://backend:5090 - Local dev:
http://localhost:5080 - (Must be reachable from the scraper container; used by the webhook client to send sync progress to backend)
- Docker prod:
Optional:
PUPPETEER_EXECUTABLE_PATH: Custom Chrome/Chromium executable path- Useful for CI/CD environments or custom browser installations
- Example:
/usr/bin/chromium-browser
ADMIN_TOKEN: Authentication token for admin endpoints- Required for
/reset-poolendpoint in non-production environments - Simple string token for basic protection
- Required for
MFC Cookie Security:
MFC_ALLOWED_COOKIES: Whitelist of cookie names allowed during authenticated MFC scraping- Default:
PHPSESSID,sesUID,sesDID,cf_clearance - Purpose: Security filter that only allows known MFC session cookies
- Format: Comma-separated list of cookie names (case-sensitive)
- Why needed: Prevents users from accidentally or maliciously injecting arbitrary cookies
- Users provide session cookies via the API; this env var controls which ones are actually used
- Default:
Debug Logging:
DEBUG: Enable debug namespaces (e.g.,scraper:*,scraper:mfc,scraper:browser)SERVICE_AUTH_TOKEN_DEBUG: Show partial tokens in logs for debugging (default: false)
Update your main application to call this service instead of direct scraping:
// MFC scraping (use environment-specific URL)
const scraperUrl = process.env.SCRAPER_SERVICE_URL || 'http://scraper:3000';
const response = await fetch(`${scraperUrl}/scrape/mfc`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ url: mfcLink })
});
// Generic scraping
const response = await fetch(`${scraperUrl}/scrape`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
url: 'https://example.com/item/123',
config: { imageSelector: '.product img' }
})
});This service runs separately from your main application to:
- Isolate browser automation resource usage
- Prevent main app crashes from scraping failures
- Allow independent scaling and updates
- Provide better browser fingerprinting
- Browser Pool: 3 pre-launched browsers eliminate 2-3 second startup delay
- Fresh Sessions: Each request gets clean browser to bypass anti-bot detection
- Auto-Replenishment: Pool automatically replaces used browsers in background
- Optimized Chrome: Container-optimized flags for minimal resource usage
- Graceful Shutdown: Proper browser cleanup on service termination
To add support for a new site, update SITE_CONFIGS in src/services/genericScraper.ts:
export const SITE_CONFIGS = {
mfc: { /* existing config */ },
hobbylink: {
imageSelector: '.product-main-image img',
manufacturerSelector: '.maker-name',
nameSelector: '.product-name h1',
scaleSelector: '.scale-info .value',
waitTime: 1500
}
};