We will start the Spark Streaming process in the DataProc cluster we created to communicate with the Kafka VM instance over the port 9092
. Remember, we opened port 9092 for it to be able to accept connections.
Establish SSH connection to the master node
```bash ssh streamify-spark
Clone git repo
git clone https://github.com/kprakhar27/streamify.git && \
cd streamify/spark_streaming
Set the evironment variables -
External IP of the Kafka VM so that spark can connect to it
Name of your GCS bucket. (What you gave during the terraform setup)
export KAFKA_ADDRESS=IP.ADD.RE.SS
export GCP_GCS_BUCKET=bucket-name
Note: You will have to setup these env vars every time you create a new shell session. Or if you stop/start your cluster
Start reading messages
spark-submit \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 \
stream_all_events.py
If all went right, you should see new parquet
files in your bucket! That is Spark writing a file every two minutes for each topic.
Topics we are reading from