We will setup airflow on docker in a dedicated compute instance. dbt is setup inside airflow.
Establish SSH connection
ssh streamify-airflow
Clone git repo
git clone https://github.com/kprakhar27/streamify.git && \
cd streamify
Install anaconda, docker & docker-compose.
bash ~/streamify/scripts/vm_setup.sh && \
exec newgrp docker
Move the service account json file from local to the VM machine in ~/.google/credentials/
directory. Make sure it is named as google_credentials.json
else the dags will fail!
Set the evironment variables (same as Terraform values)-
GCP Project ID
Cloud Storage Bucket Name
export GCP_PROJECT_ID=project-id
export GCP_GCS_BUCKET=bucket-name
Note: You will have to setup these env vars every time you create a new shell session.
Start Airflow. (This shall take a few good minutes, grab a coffee!)
bash ~/streamify/scripts/airflow_startup.sh && cd ~/streamify/airflow
Airflow should be available on port 8080
a couple of minutes after the above setup is complete. Login with default username & password as airflow.
Airflow will be running in detached mode. To see the logs from docker run the below command
docker-compose --follow
To stop airflow
docker-compose down
The setup has two dags
load_songs_dag
streamify_dag
load_songs_dag
to make sure the songs table table is available for the transformationsThis dag will run hourly at the 5th minute and perform transformations to create the dimensions and fact.
The transformations happen using dbt which is triggered by Airflow. The dbt lineage should look something like this -
Dimensions:
dim_artists
dim_songs
dim_datetime
dim_location
dim_users
Facts:
fact_streams
Finally, we create wide_stream
view to aid dashboarding.