DataEngineerExpert.com Newsletter

How to deploy and configure routing table, vnet, nsg, including nsg rules and subnets using IaC for a production ready data platform

Vedran Markulj — Mon, 05 Aug 2024 05:45:15 GMT

Before we start diving into the details here is a reason why, creating a network layer is important, don’t let anyone tell you that it’s sufficient to use managed networking built into many services on Azure. By creating a network layer that we have full control of we are preparing for a greater enterprise setup such as Hub and Spoke network architecture. In that context this network layer, housing the platform we want to build, we are going to create is a single spoke in a larger enterprise network landscape.

DISCLAIMER: I am sorry to say but I am commited to writting quality articles for my subscribers. This means there will be alot of detail and I cannot promise any of it beeing easy to setup. This also means these articles are very different from what you will find elsewhere. Many guides aim at merly introducing topics and giving examples which cannot be used in any sort of production environment. This is very different in the articles you will find here.

Infrastructure-as-code (IaC)

The IaC consists of bicep files applied to Azure infrastructure by submitting it to Azure Resource Manager (ARM)

Using DevOps pipelines. While this is the ideal solution, which we will cover, we will skip it for now and run the bicep scripts from our local machine. I want to skip the DevOps automation pipelines for now because that would take focus from the main objective of this article, which is to get an understanding of how to create a network layer infrastructure that can be used in development, testing, and production environments, and how to do that using Azure Bicep.

Create vnet and subnet and nsg

DISCLAIMER: The following commands are for local, manual deployment and should only be used during development. We will go into the DevOps pipelines automation of the IaC in a later Article. But if you are up for it by having the commands that can be run manually you practically have the recipe for that commands a DevOps pipeline should run to automate the deployment process.

Prerequisites

You need to have access to an Azure Subscription.
You need to have the Azure CLI installed on your local machine.
You need to have created a Service Principal with a Secret.
The Service Principal needs Contributor rights on the Subscription.

Access the subscription

Here is how you can access your subscription from the Azure CLI. Because we want to be able to automate the deployment in the future, we will not authenticate with the Azure CLI using a personal user, instead, let’s authenticate using a Service Principal.

How to deploy a production ready data platform centered around Databricks and Azure services.

Vedran Markulj — Sun, 21 Jul 2024 14:47:56 GMT

The goal is to deploy and configure a production-ready data platform consisting of Azure Databricks Workspace at the heart, Azure Storage account as the data lake for the medallion architecture (bronze, silver, and gold data layers), Azure Key Vault for secrets, and securing everything with Vnet, Subnets, and Network Security Group in Azure. Beyond the infrastructure I will also cover the deployment of a Framework such that Databricks is configured completely and ready to work with as a Data Engineer or Data Scientist.

The first newsletter publication and what to expect from Data Engineer Expert.

Vedran Markulj — Thu, 11 Jul 2024 19:20:00 GMT

Welcome everybody! I have finally decided to use Substack to publish my newsletter! Some of you might know it as “Stay Ahead Of The Machine Learning Curve”. Thank you for your patience as some of you “subscribed“ to my newsletter over a year ago!

I’ll be committing to writing these biweekly or monthly going forward because I believe the short-from content on LinkedIn can’t cover the same level of depth that I can cover here.

The articles I will publish here will aim at Experienced Data Engineers and Architects who work with data analytics platforms. The Data Engineering Expert newsletter is also aimed at AI leaders looking to get a better understanding of trends in the rapidly changing landscape.

My plan for the rest of 2024

To share some of my experience with Azure Databricks. I will in aim to release articles about setting up a near production-ready Databricks setup in Azure. I have previously implemented such platforms and frameworks with great success and value to the companies, and I would like to share the recipe for succeeding in establishing production-ready data analytics platforms in the cloud.

Here is a brief draft content (it is now complete and many of the topics will extend into 2025. The topics might change over time)

In the end, a setup is deeply dependent on the needs of a company and the team that will work on it. Therefore there is almost no right and wrong, however, there are some common aspects to each setup. I will try and cover these. I will also cover what you can do as a leader (enabler) for such an initiative to succeed.

The following list is not complete and you can expect other articles in between the below topics to be published.

Content (draft)

1
Create and configure vnet, subnet and nsg using bicep.

2
Create and configure storage account and connect to vnet and subnet using bicep.

3
Create Databricks workspace in Azure and connect to vnet and subnet using Bicep.

4
Configure Databricks Pool using Python and the Databricks Rest Api.

5
Configure Databricks Cluster using Python and the Databricks Rest Api.

6
Add environment variables to Databricks Cluster using Python and the Databricks Rest Api.

7
Add Python dependencies to Databricks cluster using Python and the Databricks Rest Api.

8
Add Azure Service Principal to Databricks using Python and the Databricks Rest Api.

9
Create an Azure Key Vault backed secret scope in Databricks using Python and the Databricks Rest Api.

10
Set secrets in Azure Key Vault using Rest API and service principal.

11
Retrieve secrets from Databricks secret scope connected to Azure Key Vault using Databricks Notebook.

12
Mount Azure data lake storage account on Databricks cluster from Databricks Notebook.

13
Create database (schema) in the Metastore in Databricks from Databricks Notebook.

14
Register External Table in the Metastore in Databricks from Databricks Notebook.

15
Load data from External Table into a PySpark dataframe using Databricks Notebook.

16
Create Delta Table from PySpark dataframe using Databricks Notebook.

17
Append a PySpark dataframe to Existing Delta Table using Databricks Notebook.

18
Overwrite an Existing Delta Table from a PySpark dataframe using Databricks Notebook.

19
Optimize Delta Table using Databricks Notebook.

20
Enable PySpark dataframe caching using Databricks Notebook.

21
Create a Databricks Workflow from Databricks Workspace.

22
Create a Databricks Workflow using Python and the Databricks Rest Api.

23
Change Databricks Workflow owner and run-as using Python and the Databricks Rest Api.

24
Trigger a Databricks Workflow from Azure Data Factory using the Databricks Rest Api.

25
Send parameter values from Azure Data Factory to Databricks Workflow using Databricks Rest Api.

26
Mix SQL and PySpark when working with Spark Dataframes in Databricks Notebook.

27
Trigger a Databricks Workflow from a Databricks Notebook using the Databricks Rest Api.

28
Send parameter values to Databricks Workflow when triggering from Databricks Notebook using the Databricks Rest Api.

29
Deduplicate data and Insert deduplicated data from a PySpark dataframe into an existing Delta Table.

30
Update existing Delta Table and create type two history using PySpark Merge from Databricks Notebook.

31
Create controlled Databricks Notebook and workflow deployments to a Databricks Workspace using Python and the Databricks Rest Api

32
Configure Databricks Cluster to support distributed geospatial computation using SQL in Databricks notebook.

33
Work with geospatial data using Spark SQL in Databricks notebook.

34
Work with geospatial data using User Defined Functions and PySpark in Databricks notebook.

… more to come …

Thanks so much for reading my very first newsletter! I’m excited to be sharing more of my learnings with you as I continue this “entrepreneurship“ journey! Follow me on LinkedIn (https://www.linkedin.com/in/vedranmarkulj/).

Subscribe now