The first newsletter publication and what to expect from Data Engineer Expert.
What is going to happen for the reminder of 2024
Failed to render LaTeX expression — no expression found
Welcome everybody! I have finally decided to use Substack to publish my newsletter! Some of you might know it as “Stay Ahead Of The Machine Learning Curve”. Thank you for your patience as some of you “subscribed“ to my newsletter over a year ago!
I’ll be committing to writing these biweekly or monthly going forward because I believe the short-from content on LinkedIn can’t cover the same level of depth that I can cover here.
The articles I will publish here will aim at Experienced Data Engineers and Architects who work with data analytics platforms. The Data Engineering Expert newsletter is also aimed at AI leaders looking to get a better understanding of trends in the rapidly changing landscape.
My plan for the rest of 2024
To share some of my experience with Azure Databricks. I will in aim to release articles about setting up a near production-ready Databricks setup in Azure. I have previously implemented such platforms and frameworks with great success and value to the companies, and I would like to share the recipe for succeeding in establishing production-ready data analytics platforms in the cloud.
Here is a brief draft content (it is now complete and many of the topics will extend into 2025. The topics might change over time)
In the end, a setup is deeply dependent on the needs of a company and the team that will work on it. Therefore there is almost no right and wrong, however, there are some common aspects to each setup. I will try and cover these. I will also cover what you can do as a leader (enabler) for such an initiative to succeed.
The following list is not complete and you can expect other articles in between the below topics to be published.
Content (draft)
1
Create and configure vnet, subnet and nsg using bicep.
2
Create and configure storage account and connect to vnet and subnet using bicep.
3
Create Databricks workspace in Azure and connect to vnet and subnet using Bicep.
4
Configure Databricks Pool using Python and the Databricks Rest Api.
5
Configure Databricks Cluster using Python and the Databricks Rest Api.
6
Add environment variables to Databricks Cluster using Python and the Databricks Rest Api.
7
Add Python dependencies to Databricks cluster using Python and the Databricks Rest Api.
8
Add Azure Service Principal to Databricks using Python and the Databricks Rest Api.
9
Create an Azure Key Vault backed secret scope in Databricks using Python and the Databricks Rest Api.
10
Set secrets in Azure Key Vault using Rest API and service principal.
11
Retrieve secrets from Databricks secret scope connected to Azure Key Vault using Databricks Notebook.
12
Mount Azure data lake storage account on Databricks cluster from Databricks Notebook.
13
Create database (schema) in the Metastore in Databricks from Databricks Notebook.
14
Register External Table in the Metastore in Databricks from Databricks Notebook.
15
Load data from External Table into a PySpark dataframe using Databricks Notebook.
16
Create Delta Table from PySpark dataframe using Databricks Notebook.
17
Append a PySpark dataframe to Existing Delta Table using Databricks Notebook.
18
Overwrite an Existing Delta Table from a PySpark dataframe using Databricks Notebook.
19
Optimize Delta Table using Databricks Notebook.
20
Enable PySpark dataframe caching using Databricks Notebook.
21
Create a Databricks Workflow from Databricks Workspace.
22
Create a Databricks Workflow using Python and the Databricks Rest Api.
23
Change Databricks Workflow owner and run-as using Python and the Databricks Rest Api.
24
Trigger a Databricks Workflow from Azure Data Factory using the Databricks Rest Api.
25
Send parameter values from Azure Data Factory to Databricks Workflow using Databricks Rest Api.
26
Mix SQL and PySpark when working with Spark Dataframes in Databricks Notebook.
27
Trigger a Databricks Workflow from a Databricks Notebook using the Databricks Rest Api.
28
Send parameter values to Databricks Workflow when triggering from Databricks Notebook using the Databricks Rest Api.
29
Deduplicate data and Insert deduplicated data from a PySpark dataframe into an existing Delta Table.
30
Update existing Delta Table and create type two history using PySpark Merge from Databricks Notebook.
31
Create controlled Databricks Notebook and workflow deployments to a Databricks Workspace using Python and the Databricks Rest Api
32
Configure Databricks Cluster to support distributed geospatial computation using SQL in Databricks notebook.
33
Work with geospatial data using Spark SQL in Databricks notebook.
34
Work with geospatial data using User Defined Functions and PySpark in Databricks notebook.
… more to come …
Thanks so much for reading my very first newsletter! I’m excited to be sharing more of my learnings with you as I continue this “entrepreneurship“ journey! Follow me on LinkedIn (https://www.linkedin.com/in/vedranmarkulj/).