Create near-real time Data Dashboard using Azure Databricks

Azure Databricks is a great tool to set up a streaming application where a user can get insight to some data either in real-time or near rear-time. In this blog series we build a streaming application to get real-time road traffic information from Finnish Transport Agency (FTA) open data API. The data is ingested by a custom container application made with .NET Core, processed by Azure Databricks via Event Hub, and then visualized with Microsoft Power BI tool.

Creating a streaming application in Azure

There are several ways to describe and implement the architecture for near-real-time pipelines. For example code for Data Ingestion (the code interfacing with the external API)  can be implemented as Azure Functions App, Web App, or as A Logic App, if a low code solution is preferred.

In this demo we ended up implementing it as a .NET Core application running as a docker container hosted in an Azure Container Instance.

 Why this choice? There are several benefits, the most important one is that using own code gives us the most flexible way to enhance architecture in future. For example, if we want to create a resilient distributed architecture with Kubernetes, the migration path is easy. At minimum we only need to move the container instance to Kubernetes, and if we want to split the code to several containers, that can be easily done with good designed .Net Core application. Using .NET Core gives us also a flexible environment with basically unlimited opportunities to enhance functionality.

After data has been ingested, we send data to Azure Event Hub, which is an excellent Azure tool to passing events between the components. Azure Event Hub can also persist events up to 7 days, which increases the resiliency of the overall architecture.

For Data Refinery we choose Azure Databricks Service. Databricks is currently the hottest data refinery tool in Azure world and Beyond. Based on Apache Spark, it provides a flexible and distributed computing platform both for data scientist and data engineers. It can act as a technical platform for complex data science problems, act as an advanced ETL tool and handle the streaming, real-time data as well. Further, Databricks supports Python, R and Scala. In this article all examples are implemented with Python For more complex scenarios you may want to use Scala as well being more performant choice.

For Data storage we utilized relatively new Databricks Delta tables. Delta tables are currently under preview in Azure platform but is already a stable feature enough to be used in this demo. When storing data to delta tables instead of external storage we reduce the overall complexity of the system, but for production use cases you may consider to use for example Azure Cosmos DB , Azure Data Lake Store gen2, or Azure SQL Data Warehouse, depending on your use case.

For publishing we use Power BI, which gets the data from Databricks Delta Tables directly, but in real production cases it may get data e.g. from Azure SQL Data Warehouse or other sources.


 So giving these options, our architecture of this demo application is as follows:

Next what we need to do is the following:

  • Setting up Azure Databricks to listen events in the Event Hub and passing the data to Databricks Delta Tables
  • Implementing the ingesting container application to interface with Digitraffic open API
  • Defining necessary tables and views in Databricks Delta Tables for easy reporting
  • Designing the Power BI visualization


Setting Up Azure Databricks

If we want to read data from Event hub, we need to get an Event Hub connector for Databricks. Event Hub connector is a open source project hosted in GitHub. It is possible of course to compile the code from the very start, but especially if you are using Python in Databricks, the best and easiest way is to install the connector from  Maven Central.

In this demo we load the libraries under the Shared folder so that the same library can be shared for different applications using Azure Event Hub.


Point your mouse on top of Shared folder as shown in Figure 2, right click and select Create -> Library

Figure 2:Creating Library to Shared Folder

Then select Library Source to be Maven, and enter Maven Coordinates for Azure Event Hub:


After you insert Maven Central Coordinates, press Create.  The system will automatically resolve the needed dependencies and download those packages as well.  Automatic dependency management is one of the main benefits of using Maven to install libraries instead of installing them manually.

Figure 3: Ready to install the connector from Maven Central

Now after we have Azure Event Hub connector installed in the Databricks, we are ready to write the code reading data from the Event Hub, transforming it, and writing to the destination sink.

Databricks supports Scala, Python and R languages. Databricks is based on Apache Spark which is implemented by using Scala, which makes typically Scala the best choice when performance is a top priority. However, In order to read from EventHub, we need an Event Hub connection string, which can be get from Event Hub blade in Azure portal.

connectionString = <connectionString here>

ehConf = {}

ehConf[‘eventhubs.connectionString’] = connectionString

The connection string is all what we need for our streaming application in this blog, but there is actually a number of options which can be used, for example if you need to start reading in the middle of the stream, you can set offset to point to a specific messages and so on.


Reading streaming data from Event Hub

After configuration has been done, it is time to write a notebook. Go to WorkSpace under Home and by rightclicking in your home directory, select Create Notebook.

Figure 4: Create Notebook

In order to connect to Event Hub, we need to provide the connection string to Event Hub. Event Hub connector accepts a variety of configuration options, for example we can define the exact sequence number or DateTime when message is enqueued. In this demo we only provide the EventHub connection string:

Figure 5: Event Hub connection string configuration

The configuration array ehConf is given as a parameter to options -method, so that streaming machine can point to the correct event hub.

We start the process by reading data flowing from Event Hub to a construct called DataFrame.  DataFrame is a Spark internal construct, which loosely is corresponding a table in a traditional database; however a spark dataframe is able to store also nested and array data in its cells.

 Dataframe is an abstract construction, and acts as a main method in the latest Spark versions to distribute data between control and worker nodes. <check terms>. In this demo what we do is that we load the data coming from Event Hub into a dataframe, do necessary transformations, and then save the transformations to a Parquet file in the Data Lake for further processing in the demo application.

The result of this process is that we can basically do a very little code, and Spark engine on the lower levels will take care of serialization/deserialization, processing and bunch of other things to get job done.

The reading of data into a dataframe is done with the following code:

Figure 6: Reading data from eventhub into a Dataframe

What the code does, is that it will call readstream method in spark context which is a default context available in all Spark notebooks, tells we are expecting data in eventhub format, then we give the connection string as options (**ehConf) notation which is basically an array, and finally we want to load the data in with load().

Now in a code level we can think that data has been loaded, but in fact Spark processing is based on lazy evaluation and creating execution plan based on the whole code, so in fact no data loading is really happening. This is a great abstraction for a developer as we do not need to solve the problem how to load possibly millions of rows into a dataframe but we can focus on the solving the actual problem.

After data has been ‘loaded’ to readDF dataframe, we need to do some processing. Payload data from the event hub is in binary format, so what we need to do is to cast it into a string format with the expression illustrated in the Figure 6. In addition to payload data, the event hub is sending some metadata, like the time when event for queued, partition related data and so on. In this demo we are interested only on the payload data, so we do select(“body”) to get only the payload into bodyDF dataframe.

Figure 7: Processing Azure Event Hub message

Now when data is read to a data frame, we could to variety of transformations for the data. In this application what we want to do is to enrich the real-time road information data with more static reference data, so that the data can be then reported meaningfully in the Power BI.

Partitioning data

When storing a large amount the data, it is important to partition the data based on some criteria. Partitioning of the data, when the partition key has been correctly defined, makes querying of data faster when amount of data is huge. This is a typical case with real-time data as data is ingested in fast speed, and we want to store all the data we get for the future analytical use.

Normally using DateTime or some variation of it consists a good candidate for a partition key, as queries typically are done over a time frame. Using DateTime as a partition key also helps to solve data issues later on, as we can locate faulty parquet files easily by browsing to a correct folder.

So How to create a partitionKey? One good candidate is to use field ‘enqueuedTime’, which is generated by the Azure Event Hub. In order to make it more usable as a partition key, we want only to use ‘Date’ part of it. To get the data part, we use pyspark function to_date():

Figure 8. Adding partition key column to a data frame


Storing the data

After creating the partition key, we are ready to store data to the delta tables.

Figure 9:  Storing data to Delta

writeStream tells Databricks to that we want to write the stream data frame to somewhere. partitionBy  tells that we want to partition the data based on some key – in this case it is the month-day combination we created in the previous phase.

outputMode sets the way how we want to write the data. When set to append only new rows are written to output sink, which is the default output mode.

option checkpointLocation tells the writer where to save checkpoint information. This is an internal bookkeeping functionality by Databricks / Spark and enables the process to recover easily in case of failures.

Then at the end we start writing with start method giving the file location as parameter. The format is delta, which is basically a set of Parquet files with some additional delta internal meta data files.

 In the following articles we then process the data further, but this is now all we need to create a functional streaming process.

In the next article we start to feed the event hub with real time data from Digitraffic service and see on how the data flows in to the Delta Parquet files for further processing.


Written by:

Harri Artinaho


Have you heard of our new solution tribe in Siili? We call it Siili Azure Studio and It is all about Microsoft Azure. And hey, we are hiring, join us to the dark side as   !

Read more

What's next

See our latest projects

Elo - Human-oriented Approach on Pension

Elo - Human-oriented Approach on Pension

Taking a customer oriented approach, continuously collecting customer feedback and iterating the service based on that – those are the main principles of building eläkepalvelu (pension service) of Elo, an employee pensio...

Upcoming events/courses

AI & Data Intelligence 2019

AI & Data Intelligence 2019

At the epicenter of technological change, AI & Data Intelligence 2019 provides a chance for all the players in the industry to learn, network, and share insight on how to stay ahead.

Other inspiring stories

  • Dad Challenge wins Miehen työ 2019 award

    • Dad Challenge wins Miehen työ 2019 award
      Siili has been awarded with Miehen työ 2019 award for Isähaaste aka Dad Challenge! Tane, the Council for Gender Equality, gives the award yearly to a person or party that has promoted gender equality especially from the point of view of men and boys. This year the emphasis was on the promotion of balance between work and family, and Dad Challenge is all about that. Dad Challenge ( is a movement that Siili has been a part of, and our CEO Timo Luhtaniemi and Senior UI Developer Joona Haavisto have actively been promoting the chal...

      Read more

  • Antero Kivikoski returns to Siili as a Creative Director

    • Antero Kivikoski returns to Siili as a Creative Director
      Antero Kivikoski fills familiar shoes as he returns to Siili as a Creative Director on May 6th. Kivikoski, who has made a long career as a designer and a creative director, will make his comeback at Siili from the position of HiQ Finland’s Chief Design Officer & Business Area director, in which he has worked for the past couple of years. “At HiQ I commercialized design processes and developed new design services, coached management teams of different companies, and acted as an advisor in several projects revolving around digital strategy and cr...

      Read more

  • Programmers, this is how you break out of your comfort zone

    • Programmers, this is how you break out of your comfort zone
      Many of us programmers have a curious personality deep inside. New things are like a toy box to us that we have to open and explore. However, daily life is not always like that. Instead of looking for new things we can end up repeating routines even when there is a need to leap outside our comfort zone. When I notice that I am getting into a rut, I start to fight against it by actively developing myself. When I surprise myself to find I am out of my comfort zone, I notice that it had arisen over time as if on its own. It was easy to work with f...

      Read more

  • Hedge - Building a new cloud deployment tool with service abstraction

    • At the start of my Siili career as an apprentice, our group was introduced to a mysterious tool: Hedge. Our apprentice project was to build an internal tool using Hedge as a cloud platform abstraction library and a deployment tool. After our apprentice program had come to an end, I was offered an opportunity to continue the development of Hedge in an internal project. Because we had already used Hedge I had some experience as a framework user. The task I was offered was to implement basic Amazon Web Services(AWS) support into Hedge.  What is th...

      Read more

Sign up

Sign up for our newsletter