- Dagster
- sbt
- Papermill
- Airflow
- Luigi
- Oozie
- Prefect
- Joblib
- Dask - Parallel computing with task scheduling https://dask.org
- aws-data-wrangler
- Glue
- Data Pipeline
- Lambda
- Step Functions
A list of tools and whatnot under the umbrella of Data Engineering
- AWS resources:
- S3
- EC2
- RDS
- RedShift
- EMR
- Kinesis
- Athena
- Lambda
- VPC
- Glue
- Sagemaker
- AWS tools:
- Boto3 (Python)
- AWS CLI
- Cron, Airflow, Ooozie, Luigi, and/or AWS Step Functions
- Scheduling/Workflows: Airflow, Oozie, Luigi, Cron, and/or AWS Step Functions
- Spark
- Data Transformation: Pandas, Dask
- ML Pipelines: Numpy, Scikit-Learn
- Python
- Bash
- SQL
- Optional: Scala (for Spark), Java (for Spark, Kafka, or Storm)
- Fact-Dimensional Warehouses
- Slowly Changing Dimensions
- Star Schema, Snowflake Schema
- Index Tuning
- Query Tuning
- Transactional Processing: Lock and Block
- OLTP vs OLAP
- Lambda Architecture
- Kappa Architecture
- Batch
- Mini-Batch
- Streaming
- Click and Drag (e.g. Looker)
- SQL Based (e.g. Tableau, Looker, Mode, Periscope)
- SQL/Python/R based (e.g. Jupyter, Mode)
- RedShift
- BigQuery
- Snowflake
- RDBMS (e.g. AWS RDS, Google SQL)
- Docker
- CI/CD (CircleCI, TravisCI, or Jenkins)
- Pytest (or Unittest)
- Tox
- AWS CLI
- Bash (Awk, Grep, Sed)
- Click
- Argparse
- Python-Fire
A quick preview of some favorite tools and/or frameworks are:
-
Python
- Airflow
- Spark
- Dask
- Pandas
- Boto3 (AWS SDK)
- Flask
- Pyramid
- Scikit-Learn
- TensorFlow
- Apache Arrow - PyArrow
-
Jupyter Notebooks
- JupyterHub
-
DockerHub
-
Databases
- Postgres
- RedShift
- Presto
- MapD (the database)
- ElasticSearch (index/search engine)
-
Business Intelligence Tools
- MapD Database, Charting, Rendering Engine by OmniSci
- SuperSet
- ELK Stack (Kibana)
- Bokeh
- Plotly
- Dash
-
Data Visualization Projects
Links
- https://dwbi.org/etl
- https://dwbi.org/etl/etl/54-incremental-loading-for-dimension-table
- https://gtoonstra.github.io/etl-with-airflow/datavault.html
- https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a
- https://gtoonstra.github.io/etl-with-airflow/datavault2.html
- https://aws.amazon.com/blogs/big-data/top-8-best-practices-for-high-performance-etl-processing-using-amazon-redshift/
- https://gtoonstra.github.io/etl-with-airflow/platform.html
- https://panoply.io/data-warehouse-guide/etl-tutorial/
- https://stefdata.github.io/
- https://medium.com/@natekupp/getting-started-the-3-stages-of-data-infrastructure-556dac82e825
- https://tombreur.wordpress.com/2017/04/30/the-past-and-future-of-dimensional-modeling/
- https://medium.com/@rchang/a-beginners-guide-to-data-engineering-part-ii-47c4e7cbda71
- https://medium.com/@rchang/a-beginners-guide-to-data-engineering-part-i-4227c5c457d7