Common Crawl Data Extraction

Extract data from common crawl using elastic map reduce

Note: This project uses Python 2.7.11

CommonCrawlJob is a framework which wraps the MRJob hadoop library for streaming analytics over internet scale data.

For more information on using MRJob framework.

Setup

To develop locally, you will need to install the mrjob Hadoop streaming framework library, and the boto library for accessing amazon cloud public dataset resources.

Use pip to install these libraries.

$ pip install CommonCrawlJob

Getting Started

To first get started, we are going to create a Google Analytics extractor. We will go from start to finish in creating a Common Crawl extractor that uses regular expression capture groups to extract google analytics tracker id's.

First let's create a file GoogleAnalytics.py.

$ touch GoogleAnalytics.py

Using a text editor, write to this file

import re

from ccjob import CommonCrawl

class GATagJob(CommonCrawl):

    def process_record(self, body):
        # Regular Expression for Google Analytics Tracker
        pat = re.compile(r"[\"\']UA-(\d+)-(\d)+[\'\"]")

        for match in pat.finditer(body):
            if match:
                yield match.groups()[0]

        self.increment_counter('commoncrawl', 'processed_document', 1)


if __name__ == '__main__':
    GATagJob.run()

Our GATagJob class has one method process_record taking in one argument containing the body of a HTML file and yields the results matching our regular expression.

All common crawl jobs will generally obey this pattern.

Testing Locally

Run the Google Analytics extractor locally to test your script.

$ python GoogleAnalytics.py -r local <(tail -n 1 data/latest.txt)

Region Configuration

For best performance, you should launch the cluster in the same region as your data. Currently data from aws-publicdatasets are stored in us-east-1, which is where you want to point your EMR cluster.

Common Crawl Region

S3:	US Standard
EMR:	US East (N. Virginia)
API:	`us-east-1`

Create an Amazon EC2 Key Pair and PEM File

Amazon EMR uses an Amazon Elastic Compute Cloud (Amazon EC2) key pair to ensure that you alone have access to the instances that you launch.

The PEM file associated with this key pair is required to ssh directly to the master node of the cluster.

To create an Amazon EC2 key pair:

Go to the Amazon EC2 console
In the Navigation pane, click Key Pairs
On the Key Pairs page, click Create Key Pair
In the Create Key Pair dialog box, enter a name for your key pair, such as, mykeypair
Click Create
Save the resulting PEM file in a safe location

Configuring `mrjob.conf`

Make sure to download an EC2 Key Pair pem file for your map reduce job and add it to the ec2_key_pair and ec2_key_pair_file variables.

Make sure that the PEM file has permissions set properly by running

$ chown 600 $MY_PEM_FILE

Download the latest version of python to send to your EMR instances.

$ wget https://www.python.org/ftp/python/2.7.11/Python-2.7.11.tgz

Create a mrjob.conf file to set up your configuration parameters to match that of AWS.

There is a default configuration template located at mrjob.conf.template that you can use.

runners:
  emr:
    aws_region: 'us-east-1'
    aws_access_key_id: <Your AWS_ACCESS_KEY_ID>
    aws_secret_access_key: <Your AWS_SECRET_ACCESS_KEY>
    cmdenv:
        AWS_ACCESS_KEY_ID: <Your AWS_ACCESS_KEY_ID>
        AWS_SECRET_ACCESS_KEY: <Your AWS_SECRET_ACCESS_KEY>
    ec2_key_pair: <Path to your PEM file>
    ec2_key_pair_file: <Name of the Key>
    ssh_tunnel_to_job_tracker: true
    ec2_instance_type: 'm1.xlarge'
    ec2_master_instance_type: 'm1.xlarge'
    emr_tags:
        name: '<Your Project Name>'
    num_ec2_instances: 12
    ami_version: '2.4.10'
    python_bin: python2.7
    interpreter: python2.7
    bootstrap_action:
        - s3://elasticmapreduce/bootstrap-actions/install-ganglia
    upload_files:
        - CommonCrawl.py
    bootstrap:
        - tar xfz Python-2.7.11.tgz#
        - cd Python-2.7.11
        - ./configure && make && sudo make install
        - sudo python2.7 get-pip.py#
        - sudo pip2 install --upgrade pip setuptools wheel
        - sudo pip2 install -r requirements.txt#

Run on Amazon Elastic MapReduce

First copy the mrjob.conf.template into mrjob.conf

Note: > Make sure to fill out the necessary AWS credentials with your information

$ python GoogleAnalytics.py -r emr \
                            --conf-path="mrjob.conf" \
                            --output-dir="s3n://$S3_OUTPUT_BUCKET" \
                           data/arcindex.txt

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
ccjob		ccjob
data		data
docs		docs
.gitignore		.gitignore
LICENSE		LICENSE
README.rst		README.rst
mrjob.conf.template		mrjob.conf.template
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Common Crawl Data Extraction

Setup

Getting Started

Testing Locally

Region Configuration

Common Crawl Region

Create an Amazon EC2 Key Pair and PEM File

To create an Amazon EC2 key pair:

Configuring `mrjob.conf`

Run on Amazon Elastic MapReduce

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Common Crawl Data Extraction

Setup

Getting Started

Testing Locally

Region Configuration

Common Crawl Region

Create an Amazon EC2 Key Pair and PEM File

To create an Amazon EC2 key pair:

Configuring mrjob.conf

Run on Amazon Elastic MapReduce

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Configuring `mrjob.conf`

Packages