[GCP Dataflow Pipeline with Python] #GCP #Google #Python #Dataflow #CloudShell #DataAnalysis
# get account info
gcloud auth list
# get projectID
gcloud config list project
# download data
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
# click File > Refresh so that you can see the data appearing in the Cloud Shell
# Create a Bucket if there is none (STORAGE > BROWSER > CREATE BUCKET)
# Verify that GC Dataflow API is enabled: APIs & Services > Dashboard > Google Dataflow API > Enable
# Open Dataflow project
# Therefore we need to install Apache BEAM (open source platfoem for executing data processing workflows)
cd ~/training-data-analyst/courses/data_analysis/lab2/python
sudo ./install_packages.sh
pip -V # should be > 8.0
# File > Refresh (in Cloud Shell)
# Pipeline Filtring
# Run filter on pipeline locally
# Run filter on pipeline on the cloud