Mikkel84
11/10/2018 - 11:46 AM

[GCP Dataflow Pipeline with Python] #GCP #Google #Python #Dataflow #CloudShell #DataAnalysis

[GCP Dataflow Pipeline with Python] #GCP #Google #Python #Dataflow #CloudShell #DataAnalysis

# get account info
gcloud auth list

# get projectID
gcloud config list project

# download data
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
  
# click File > Refresh so that you can see the data appearing in the Cloud Shell
# Create a Bucket if there is none (STORAGE > BROWSER > CREATE BUCKET)
# Verify that GC Dataflow API is enabled: APIs & Services > Dashboard > Google Dataflow API > Enable

# Open Dataflow project
# Therefore we need to install Apache BEAM (open source platfoem for executing data processing workflows)
cd ~/training-data-analyst/courses/data_analysis/lab2/python
sudo ./install_packages.sh
pip -V # should be > 8.0
# File > Refresh (in Cloud Shell)

# Pipeline Filtring

# Run filter on pipeline locally
# Run filter on pipeline on the cloud