This project processes URLs to fetch Lighthouse scores using the Google PageSpeed Insights API.
-
Clone the repository:
git clone https://github.com/yourusername/my_project.git cd my_project -
Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install the dependencies:
pip install -r requirements.txt
-
Add your Google API key: Update the
API_KEYinconfig.pywith your Google PageSpeed Insights API key. -
Place your input data: Ensure that your
cwv.csvfile is in thedatadirectory.
- All URLs in the
cwv.csvfile must be in the formathttps://www.example.com. - Ensure that each URL starts with
https://and includeswww.to avoid any issues with API requests.
- The
cwv.csvfile should include aplatformcolumn, which differentiates between "Carrot" and "Non-Carrot" sites. - This data is directly gleaned from the script output at carrot-serp-compare. The
TRUEandFALSEvalues from this script need to be turned into "Carrot" and "Non-Carrot" respectively. - The reason for this differentiation is to provide mean CWV (Core Web Vitals) scores for Carrot vs Non-Carrot sites in the comparison file at the end of the processing.
- Run the main script:
python main.py
- The processed Lighthouse scores will be saved in
data/lighthouse_scores.csv. - The comparison results will be saved in
data/comparison_results.csv. - Errors and logs will be saved in
logs/errors.log.
- The current setup uses parallel processing to speed up the fetching of Lighthouse scores.
- URLs are processed concurrently using Python's
concurrent.futures.ThreadPoolExecutor, which significantly reduces the total processing time. That said, the script does take a long time to execute when there are thousands of URLs. When ran in a virtual environment, your console will provide updates, such as X/Y processed. - By default, the script uses 10 threads to handle multiple requests in parallel. This can be adjusted by modifying the
max_workersparameter in themain.pyscript.