Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

README.md

Data Engineering Technologies, Tools, and Practices

Task, Workflow Managers

Articles

AWS Tools

data-engineering-tools

A list of tools and whatnot under the umbrella of Data Engineering

Workflow Managers - ETL

Systems Engineering Overview

Things a Data Engineer should know

Cloud Computing (e.g. AWS, Google Cloud, Azure)

  • AWS resources:
    • S3
    • EC2
    • RDS
    • RedShift
    • EMR
    • Kinesis
    • Athena
    • Lambda
    • VPC
    • Glue
    • Sagemaker
  • AWS tools:
    • Boto3 (Python)
    • AWS CLI

Job Scheduling

  • Cron, Airflow, Ooozie, Luigi, and/or AWS Step Functions

Frameworks

  • Scheduling/Workflows: Airflow, Oozie, Luigi, Cron, and/or AWS Step Functions
  • Spark
  • Data Transformation: Pandas, Dask
  • ML Pipelines: Numpy, Scikit-Learn

Languages

  • Python
  • Bash
  • SQL
  • Optional: Scala (for Spark), Java (for Spark, Kafka, or Storm)

Data Modeling

  • Fact-Dimensional Warehouses
  • Slowly Changing Dimensions
  • Star Schema, Snowflake Schema
  • Index Tuning
  • Query Tuning
  • Transactional Processing: Lock and Block
  • OLTP vs OLAP

Data Warehousing Architectures

  • Lambda Architecture
  • Kappa Architecture
  • Batch
  • Mini-Batch
  • Streaming

Business Intelligence Tools

  • Click and Drag (e.g. Looker)
  • SQL Based (e.g. Tableau, Looker, Mode, Periscope)
  • SQL/Python/R based (e.g. Jupyter, Mode)

Data Warehouse Serving Layers

  • RedShift
  • BigQuery
  • Snowflake
  • RDBMS (e.g. AWS RDS, Google SQL)

Other Important Development Tools

  • Docker
  • CI/CD (CircleCI, TravisCI, or Jenkins)
  • Pytest (or Unittest)
  • Tox

CLI Tools

  • AWS CLI
  • Bash (Awk, Grep, Sed)
  • Click
  • Argparse
  • Python-Fire

A quick preview of some favorite tools and/or frameworks are:

ETL Articles

Links