Extract data from common crawl using elastic map reduce
Note: This project uses Python 2.7.11
CommonCrawlJob is a framework which wraps the MRJob hadoop library for streaming
analytics over internet scale data.
For more information on using MRJob framework.
To develop locally, you will need to install the mrjob Hadoop
streaming framework library, and the boto library for accessing amazon cloud
public dataset resources.
Use pip to install these libraries.
$ pip install CommonCrawlJobTo first get started, we are going to create a Google Analytics extractor. We will go from start to finish in creating a Common Crawl extractor that uses regular expression capture groups to extract google analytics tracker id's.
First let's create a file GoogleAnalytics.py.
$ touch GoogleAnalytics.pyUsing a text editor, write to this file
import re
from ccjob import CommonCrawl
class GATagJob(CommonCrawl):
def process_record(self, body):
# Regular Expression for Google Analytics Tracker
pat = re.compile(r"[\"\']UA-(\d+)-(\d)+[\'\"]")
for match in pat.finditer(body):
if match:
yield match.groups()[0]
self.increment_counter('commoncrawl', 'processed_document', 1)
if __name__ == '__main__':
GATagJob.run()Our GATagJob class has one method process_record taking in one argument containing
the body of a HTML file and yields the results matching our regular expression.
All common crawl jobs will generally obey this pattern.
Run the Google Analytics extractor locally to test your script.
$ python GoogleAnalytics.py -r local <(tail -n 1 data/latest.txt)For best performance, you should launch the cluster in the same region
as your data. Currently data from aws-publicdatasets are stored in
us-east-1, which is where you want to point your EMR cluster.
| S3: | US Standard |
|---|---|
| EMR: | US East (N. Virginia) |
| API: | us-east-1 |
Amazon EMR uses an Amazon Elastic Compute Cloud (Amazon EC2) key pair to ensure that you alone have access to the instances that you launch.
The PEM file associated with this key pair is required to ssh directly to the master node of the cluster.
- Go to the Amazon EC2 console
- In the Navigation pane, click Key Pairs
- On the Key Pairs page, click Create Key Pair
- In the Create Key Pair dialog box, enter a name for your key pair, such as, mykeypair
- Click Create
- Save the resulting PEM file in a safe location
Make sure to download an EC2 Key Pair pem file for your map reduce
job and add it to the ec2_key_pair and ec2_key_pair_file
variables.
Make sure that the PEM file has permissions set properly by running
$ chown 600 $MY_PEM_FILEDownload the latest version of python to send to your EMR instances.
$ wget https://www.python.org/ftp/python/2.7.11/Python-2.7.11.tgzCreate a mrjob.conf file to set up your configuration parameters to match
that of AWS.
There is a default configuration template located at mrjob.conf.template that you can use.
runners:
emr:
aws_region: 'us-east-1'
aws_access_key_id: <Your AWS_ACCESS_KEY_ID>
aws_secret_access_key: <Your AWS_SECRET_ACCESS_KEY>
cmdenv:
AWS_ACCESS_KEY_ID: <Your AWS_ACCESS_KEY_ID>
AWS_SECRET_ACCESS_KEY: <Your AWS_SECRET_ACCESS_KEY>
ec2_key_pair: <Path to your PEM file>
ec2_key_pair_file: <Name of the Key>
ssh_tunnel_to_job_tracker: true
ec2_instance_type: 'm1.xlarge'
ec2_master_instance_type: 'm1.xlarge'
emr_tags:
name: '<Your Project Name>'
num_ec2_instances: 12
ami_version: '2.4.10'
python_bin: python2.7
interpreter: python2.7
bootstrap_action:
- s3://elasticmapreduce/bootstrap-actions/install-ganglia
upload_files:
- CommonCrawl.py
bootstrap:
- tar xfz Python-2.7.11.tgz#
- cd Python-2.7.11
- ./configure && make && sudo make install
- sudo python2.7 get-pip.py#
- sudo pip2 install --upgrade pip setuptools wheel
- sudo pip2 install -r requirements.txt#First copy the mrjob.conf.template into mrjob.conf
Note: > Make sure to fill out the necessary AWS credentials with your information
$ python GoogleAnalytics.py -r emr \
--conf-path="mrjob.conf" \
--output-dir="s3n://$S3_OUTPUT_BUCKET" \
data/arcindex.txt