A lightweight pipeline to collect and analyse 'EU Mission on Adaptation' related data.
Create and activate a virtual environment, then install dependencies:
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txtGenerated data is kept separate from code:
data/links/contains link lists produced by the home spider.data/pages/contains one JSON file per scraped story page.
Basic crawl:
scrapy crawl adaptation_stories_home -O data\links\links.jsonLimit pages (example):
scrapy crawl adaptation_stories_home -a max_pages=3 -O data\links\links_test.jsonScrape the story pages from a link list:
scrapy crawl adaptation_stories_pages -a input_file=data/links/links.jsonLimit how many story pages are scraped:
scrapy crawl adaptation_stories_pages -a input_file=data/links/links.json -a max_links=3Run the parser smoke test:
pytest -qDocs:
ARCHITECTURE.mdAI_GUIDE.md
The scripts.run_analysis CLI accepts --use-case and loads source and prompt paths from config/analysis_use_cases.json.
Each use case must define source_path, system_prompt_path, and user_prompt_path; the CLI raises an error if any of these files are missing or empty.
If --provider or --output is omitted, the CLI falls back to PROVIDER and OUTPUT_DIR from .env.
When --use-case is present, the CLI writes into a timestamped subfolder under --output by default, even if --timestamped-output-dir is not passed.
When you explicitly pass a path argument on the CLI (--input, --output, --file, --source-path, --system-prompt-file, or --user-prompt-file), it must be an absolute path.
--user-prompt-file overrides the use case user_prompt_path when --use-case is also specified.
Run the analysis stub over saved pages:
python -m scripts.run_analysis --input C:/absolute/path/to/data/pages --output C:/absolute/path/to/data/analysis --max-items 5Suppress progress output:
python -m scripts.run_analysis --input C:/absolute/path/to/data/pages --output C:/absolute/path/to/data/analysis --max-items 5 --quietOverwrite existing analysis outputs:
python -m scripts.run_analysis --input C:/absolute/path/to/data/pages --output C:/absolute/path/to/data/analysis --max-items 5 --overwriteCreate a timestamped output subfolder (for example data/analysis/20260227_143015):
python -m scripts.run_analysis --input C:/absolute/path/to/data/pages --output C:/absolute/path/to/data/analysis --max-items 5 --timestamped-output-dirIf you combine --timestamped-output-dir with --overwrite, the script prints a warning because each run writes to a new folder, so overwrite has no practical effect.
Dry run (no files written):
python -m scripts.run_analysis --input C:/absolute/path/to/data/pages --output C:/absolute/path/to/data/analysis --max-items 5 --dry-runSelect provider and model:
python -m scripts.run_analysis --provider openai --model gpt-4o --input C:/absolute/path/to/data/pages --output C:/absolute/path/to/data/analysisUse the mock provider (no API calls, no token usage):
python -m scripts.run_analysis --provider mock --input C:/absolute/path/to/data/pages --output C:/absolute/path/to/data/analysis --max-items 5Example for using inside the Virtual Machine of EEA
python -m scripts.run_analysis --provider eea --input C:/absolute/path/to/data/pages --output C:/absolute/path/to/data/analysis --max-items 5Run a configured use case from config/analysis_use_cases.json:
python -m scripts.run_analysis --use-case adaptation_stories --output C:/absolute/path/to/data/analysis --max-items 5For use-case runs, the CLI prints run_id: <timestamp> after a successful run. Use that value to export a specific run later.
Override a use-case source path from the CLI with --input:
python -m scripts.run_analysis --use-case adaptation_stories --input C:/absolute/path/to/data/pages --output C:/absolute/path/to/data/analysis --max-items 5Override a use-case source path from the CLI with --source-path:
python -m scripts.run_analysis --use-case adaptation_stories --source-path C:/absolute/path/to/data/pages --output C:/absolute/path/to/data/analysis --max-items 5Run the Excel-backed question_2_1_1_column_7 use case through the main analysis CLI:
python -m scripts.run_analysis --use-case question_2_1_1_column_7 --output C:/absolute/path/to/data/analysis --max-items 5Override Excel-specific use-case settings from the CLI:
python -m scripts.run_analysis --use-case question_2_1_1_column_7 --input C:/absolute/path/to/data/data_sources/2_1_1.xlsx --sheet-name "2.1.1" --column-name "col7_Please explain" --header-row 1 --output C:/absolute/path/to/data/analysis --max-items 5Override the system prompt file for a run:
python -m scripts.run_analysis --use-case adaptation_stories --system-prompt-file C:/absolute/path/to/prompts/system_prompt_custom.txt --output C:/absolute/path/to/data/analysis --max-items 5Override the user prompt file for a run:
python -m scripts.run_analysis --use-case adaptation_stories --user-prompt-file C:/absolute/path/to/prompts/user_prompt_custom.txt --output C:/absolute/path/to/data/analysis --max-items 5Analyze a single saved page JSON file:
python -m scripts.run_analysis --file C:/absolute/path/to/data/pages/example_page.json --output C:/absolute/path/to/data/analysis --provider mockRun the API server:
uvicorn api.app:app --host 127.0.0.1 --port 8000 --reloadWith plain uvicorn, set environment variables before startup:
$env:OUTPUT_DIR="data/analysis"; uvicorn api.app:app --host 127.0.0.1 --port 8000 --reloadRun the API server with configurable default output/export directories:
python -m scripts.run_analysis_api --host 127.0.0.1 --port 8000 --reload --output-dir C:/absolute/path/to/data/analysis --export-dir C:/absolute/path/to/data/exportsYou can also keep defaults in a config file (.env.api by default):
OUTPUT_DIR=C:/absolute/path/to/eea-ai-mission-aipossible/data/analysis
EXPORT_DIR=C:/absolute/path/to/eea-ai-mission-aipossible/data/exports
PROVIDER=mock
# API_MODEL=mock-model
# API_API_KEY=
When --output-dir or --export-dir is passed, it overrides config-file values for that server run.
When --provider, --model, or --api-key is passed, it overrides PROVIDER, API_MODEL, or API_API_KEY.
If you do not pass --config-file, the server looks for .env.api in the repo root and exits with an error if it is missing.
/v1/analysis/runs fails with 404 if the configured OUTPUT_DIR does not exist, or if the selected use case points to a missing source path.
/v1/analysis/runs returns 400 with a clear message if provider credentials are missing
(for example missing .env.<provider>.keys and no API_API_KEY override).
Health check:
Invoke-RestMethod -Method GET http://127.0.0.1:8000/healthRun analysis (sync):
Invoke-RestMethod -Method POST http://127.0.0.1:8000/v1/analysis/runs -ContentType "application/json" -Body '{"use_case":"adaptation_stories","max_items":3}'Run analysis with inline prompt override (JSON):
$body = @{
use_case = "adaptation_stories"
max_items = 3
user_prompt = @"
Analyse the following climate adaptation report text.
I would like you to analyse the following 3 questions.
1. Simplified Title
2. Locality
3. Geographic extent
"@
} | ConvertTo-Json -Depth 5
Invoke-RestMethod -Method POST http://127.0.0.1:8000/v1/analysis/runs -ContentType "application/json" -Body $bodyRun analysis by uploading a prompt file (.txt):
Invoke-RestMethod -Method POST http://127.0.0.1:8000/v1/analysis/runs/upload-prompt `
-Form @{ use_case = "adaptation_stories"; prompt_file = Get-Item ".\analysis\prompts\user_prompt.txt"; max_items = "3" }The response includes run_id, which is the folder name created under data/analysis for that run.
Provider/model/api key for runs are configured at API server level (.env.api or scripts.run_analysis_api args), not in the run request payload.
The JSON run request payload requires use_case and accepts max_items (optional) and user_prompt (optional).
The upload endpoint accepts multipart form fields use_case (required), prompt_file (.txt, UTF-8), and max_items (optional).
Use-case presets are defined in config/analysis_use_cases.json. Each entry must define:
source_type(pagesorexcel)source_path— folder of page JSON files (pages) or an Excel file (excel)system_prompt_path— absolute path to the system prompt.txtfileuser_prompt_path— absolute path to the user prompt.txtfile- For
exceluse cases:sheet_name,column_name, and optionallyheader_row(default:1)
Available use cases:
adaptation_stories:source_type=pagesquestion_2_1_1_column_7:source_type=excelquestion_4_8_column_7:source_type=excel
The Analysis API also reads input sources from these use-case presets; there is no separate API_INPUT_DIR setting anymore.
When you use python -m scripts.run_analysis --use-case ..., the CLI loads these preset values and lets explicit CLI flags override them.
Runs are written into timestamped output subfolders (output_dir/YYYYMMDD_HHMMSS) by default.
Download all analysis files for a run as ZIP:
Invoke-WebRequest -Method GET "http://127.0.0.1:8000/v1/analysis/runs/<run_id>/download" -OutFile "run_<run_id>.zip"Download Excel export for one run folder:
Invoke-WebRequest -Method GET "http://127.0.0.1:8000/v1/analysis/export/excel?run_id=<run_id>" -OutFile "analysis_<run_id>.xlsx"The API writes the workbook to EXPORT_DIR/<run_id>/analysis_<run_id>.xlsx and then streams that same file in the response.
Provider-specific defaults still come from:
.env.openaiand.env.openai.keys.env.eeaand.env.eea.keys
Use the EEA provider:
python -m scripts.run_analysis --provider eea --model eea-model --input C:/absolute/path/to/data/pages --output C:/absolute/path/to/data/analysisUse the EEA provider with a configured Excel use case:
python -m scripts.run_analysis --provider eea --model eea-model --use-case question_2_1_1_column_7 --output C:/absolute/path/to/data/analysis --max-items 5The export CLIs also fall back to .env: OUTPUT_DIR is used as the default analysis input folder and EXPORT_DIR as the default export destination.
Export analysis JSON files to Excel:
python -m scripts.export_analysis_excel --input data/analysis --output data/exports/analysis.xlsxExport one timestamped run folder to Excel (output is saved to <EXPORT_DIR>/<run_id>/analysis.xlsx automatically):
python -m scripts.export_analysis_excel --run-id 20260227_143015Disable default formatting options:
python -m scripts.export_analysis_excel --input data/analysis --output data/exports/analysis_plain.xlsx --no-header-bold --no-auto-width --no-wrap-text --no-freeze-panesExport ai_result to Markdown files:
python -m scripts.export_analysis --input data/analysis --output data/exportsExport one timestamped run folder to Markdown (output is saved to <EXPORT_DIR>/<run_id>/ automatically):
python -m scripts.export_analysis --run-id 20260227_143015Combine all outputs into one file:
python -m scripts.export_analysis --input data/analysis --output data/exports --combineSkip the metadata header:
python -m scripts.export_analysis --input data/analysis --output data/exports --no-headerOverwrite existing export files:
python -m scripts.export_analysis --input data/analysis --output data/exports --overwriteSuppress export progress output:
python -m scripts.export_analysis --input data/analysis --output data/exports --quietDry run for export (no files written):
python -m scripts.export_analysis --input data/analysis --output data/exports --dry-runRun AI pre-analysis on an Excel data source column:
python -m scripts.run_pre_analysis --input-file data/data_sources/excel_filename.xlsx --sheet-name "Sheet1" --column "column_name" --max-rows 5Specify the header row if it is not the first row:
python -m scripts.run_pre_analysis --input-file data/data_sources/excel_filename.xlsx --sheet-name "Sheet1" --column "column_name" --header-row 2 --max-rows 5Overwrite only the report:
python -m scripts.run_pre_analysis --input-file data/data_sources/excel_filename.xlsx --sheet-name "Sheet1" --column "column_name" --max-rows 5 --overwrite-reportOverwrite only the row outputs:
python -m scripts.run_pre_analysis --input-file data/data_sources/2_1_1.xlsx --sheet-name "Sheet1" --column "col7_Please explain" --max-rows 5 --overwrite-rowsThis repository includes automated security checks in GitHub:
- Dependabot (
.github/dependabot.yml) creates weekly update PRs for Python dependencies and GitHub Actions. - Security Scan workflow (
.github/workflows/security-scan.yml) runs on push/PR/schedule and fails the pipeline when:- dependency vulnerabilities with HIGH or CRITICAL severity are found
- code scanning finds high-severity Python security issues (Bandit)
