Real-Time Data Ingestion from Kafka to Snowflake
Glenn Gillen
VP of Product, GTM
If your business is using Kafka there is already a lot of messages travelling through your brokers. If you're also a Snowflake customer it would be great if those messages could make their way into your Snowflake data cloud. Kafka is in your private network, Snowflake in their cloud, and getting data between them isn't entirely straight-foward. It'd be much easier if those two systems could look and feel like they were next to each other.
Today, we're excited to introduce a solution that makes this vision a reality: the Pull from Kafka Connector!
Snowflake 💙 Apache Kafka
Apache Kafka (and Kafka-compatible alternatives) is the system of choice for building flexible, scalable, and reliable streaming platforms to connect data producers and consumers. It's often used as the central nervous system for real-time data pipelines and streaming applications.
Snowflake is The Data Cloud and the place to support workloads such as data warehouses, data lakes, data science / ML / AI, and even cybersecurity. This centralization brings a huge amount of convenience through breaking down data silos and allowing teams to make smart data-informed decisions.
Connecting your Kafka broker to Snowflake can be problematic depending on your network topology. It would be convenient to give the broker a public address, but that's a significant increase in risk for a system that handles a lot of important data. Managing IP allow lists and updating firewall ingress rules improves security but can be cumbersome to manage. Alternatives like PrivateLink are better, but they too can be cumbersome to setup and require your systems to be on the same public cloud and in the same region.
In this post I'm going to show you how to securely connect Snowflake to your private Apache Kafka broker, in just a few minutes. We will:
- Setup a managed Kafka cluster in AWS
- Prepare Snowflake to receive data from Kafka
- Connect Snowflake to Kafka with a private encrypted connection - without needing to expose either system to the public internet!
The final architecture diagram will look like this:
Amazon Managed Streaming for Kafka (MSK)
We're going to provision an MSK cluster so we can see an end-to-end experience of data moving from Snowflake to Kafka. If you have an existing Kafka broker you're able to use you can skip this step.
Create an MSK cluster
Within your AWS Console search for
MSK
in the search field at the top and select the matching result. Visit the
Clusters
screen, and then click Create Cluster
.
The Quick Create
option provides a good set of defaults for creating a
Kafka cluster, so unless you've previous knowledge or experience to know you
might want something different I'd suggest just confirming the details and
then clicking Create cluster
at the bottom of the screen.
Once you've started the cluster creation it may take about 15 minutes for provisioning to complete and for your broker to be available.
Connect Snowflake to Kafka
Create a table to capture data
Run the code here in a Snowflake worksheet. It will create a new table.
_10CREATE OR REPLACE CUSTOMERS (_10 key STRING,_10 id INTEGER,_10 name STRING,_10 email STRING_10);
We've created a table, Kafka is running. It's now time to connect the two! The next stage is going to complete the picture below, by creating a point-to-point connection between the two systems — without the need to expose any systems to the public internet!
Get the app
The Snowflake Pull from Kafka Connector by Ockam is available in the Snowflake Marketplace.
Select a warehouse
The first screen you're presented with will ask you to select the warehouse to utilize to activate the app.
Grant account privileges
Click the Grant
button to the right of this screen. The app will then be
automatically granted permissions to create a warehouse and create a compute
pool.
Activate app
Once the permissions grants complete, an Activate
button will appear. Click
it and the activation process will begin.
Launch app
After the app activates you'll see a page that summarizes the
privileges that the application now has. There's nothing we need
to review or update on these screens yet, so proceed by clicking the Launch app
button.
Launch Ockam node for Amazon MSK
The Ockam Node for Amazon MSK is a streamlined way to provision a managed Ockam Node within your private AWS VPC.
To deploy the node that will allow Snowflake to reach your Kafka broker visit
the Ockam Node for Amazon MSK listing in the AWS Marketplace, and click the Continue to Subscribe
button, and then
Continue to Configuration
.
On the configuration page choose the region that your Amazon MSK cluster is
running in, and then click Continue to Launch
followed by Launch
.
Enter stack details
The initial Create Stack screen pre-fills the fields with the correct
information for your node, so you can press Next
to proceed.
Enter node configuration
This screen has important details to you need to fill in:
- Stack name: Give this stack a recognisable name, you'll see this in various locations in the AWS Console. It'll also make it easier to clean these resources up later if you wish to remove them.
- VPC ID: The ID of the Virtual Private Cloud network to deploy the node in. Make sure it's the same VPC where you've deployed your Amazon MSK cluster.
- Subnet ID: Choose one of the subnets within your VPC, ensure MSK has a broker address available in that subnet.
- Enrollment ticket: Copy the contents of the
kafka.ticket
file we created earlier and paste it in here. - Amazon MSK Bootstrap Server with Port: In the client configuration details for your Amazon MSK cluster you will have a list of bootstrap servers. Copy the
hostname:port
value for the Private endpoint that's in the same subnet you chose above. - JSON Node Configuration: Copy the contents of the
kafka.json
file we created earlier and paste it in here.
We've now completed the highlighted part of the diagram below, and our Kafka broker node is waiting for another node to connect.
Generate enrollment ticket for Snowflake
One end of our connection is now setup, it's time to connect the Snowflake end. We need to generate an enrollment ticket to allow another Ockam Node to join our project. This node will run in our Snowflake warehouse:
_10ockam project ticket \_10 --usage-count 1 --expires-in 2h \_10 --attribute snowflake > snowflake.ticket
Configure connection details
Take the contents of the file snowflake.ticket
that we just created and paste
it into "Provide the above Enrollment Ticket" form field in the "Configure app"
setup screen in Snowflake.
Grant privileges
To be able to authenticate with Ockam Orchestrator and then
discover the route to our outlet, the Snowflake app needs to allow
outbound connections to your Ockam project. Toggle the
Grant access to egress and reach your Project
and approve the connection by
pressing Connect
.
Toggle the Grant access to the tables where Kafka messages must be inserted
and select the CUSTOMERS
table.
Map Kafka topics to Snowflake tables
Enter the name of the Kafka topic you want to map to the Snowflake table. In
this example, the topic is customers
.
Check "Messages are encoded with a schema" if you have a schema registry and the
messages are encoded with a schema. The configuration to use to Launch Ockam
node for Amazon MSK will need to be updated to include the schema registry
details. Update $SCHEMA_REGISTRY_ADDRESS
with the address of the schema
registry. Make sure the ockam node has access to the schema registry.
_13{_13 "http-server-port": 23345,_13 "relay": "kafka",_13 "kafka-outlet": {_13 "bootstrap-server": "$BOOTSTRAP_SERVER_WITH_PORT",_13 "allow": "snowflake"_13 },_13 "tcp-outlet": {_13 "to": "$SCHEMA_REGISTRY_ADDRESS:9081",_13 "from": "schema_registry",_13 "allow": "snowflake"_13 }_13 }
Update other options from default values if needed.
With that, we've completed the last step in the setup. We've now got a complete point-to-point connection that allows our Snowflake warehouse to securely pull data through to our private Kafka broker.
Seeing it in action
Any updates to your data in your Kafka topic will now create a new row in your Snowflake table.
Post the below message to the Kafka topic to verify the setup.
Replace $BROKER_ADDRESS
with your actual Kafka broker address, and ensure the
topic name (customers
in this example) matches the one you've configured in
your Snowflake Pull from Kafka Connector setup
_10echo '{"key": "customer123", "id": 1001, "name": "John Doe", "email": "john.doe@example.com"}' | \_10kafka-console-producer --broker-list $BROKER_ADDRESS:9092 --topic customers
The Snowflake connector will then pull these messages from Kafka and insert them
into your CUSTOMERS
table, mapping the JSON fields to the corresponding
columns.
Wrap up
It's all done! In the course of a few minutes we've been able to:
- Setup an Ockam node next to our Kafka broker.
- Start an Ockam node within Snowflake.
- Establish an Ockam Portal between our nodes — a secure point-to-point connection that is mutually authenticated with end-to-end encryption. With regular and automatic rotation of the encryption keys.
- Then use the secure portal to consume messages from our private Kafka broker and load them directly into a Snowflake table.
We've been able to achieve all of this without the need to expose our Kafka broker to the public internet, update firewall ingress rules, setup a VPN, or manage IP allow lists.
If you'd like to explore some other capabilities of Ockam I'd recommend:
- Encrypting data through Kafka
- Pushing data from Snowflake to Kafka
- Zero-trust data streaming with Redpanda Connect
- Adding security as a feature in your SaaS product
Next Article
Secure token management for Amazon InfluxDB