Having a quearyable Datalake with infinite storage is the secret weapon that every manufacturing company want to improve quality, safety, and performance. But… Is it a dream only?
ADX Data Explorer: how to develop a Continuous Data Lake feed, maintaining a Query Engine loopback for long term data
Using the correct approach to feed a datalake with cold data and having the Kusto query engine at your disposal at the same time, as in the short-term storage.
When you face a process control project related to the issue of production quality, you often ask yourself how the database should be structured. This database must allow the process manager or the quality manager to carry out checks related to a claim, with the possibility of going down to the telemetry level to understand, for example, what the process dimensions are (linked to a certain point of the line production) that have originated or registered an operating anomaly.
It is therefore clear that, starting from a classic view given by a DWH, it must be possible to go down to the telemetry level, always filtering the same search dimensions.
Often the process manager and the quality manager also have to face discussions with many years of depth. An example is to compare the same process data, linked to the same recipe but to different production runs.
So how is it possible to have a database that contains long-term telemetry? Often the issue is linked to the need to have a datalake, but it is less clear how to make it usable, in terms of data exploration.
ADX Data Explorer, on the other hand, is a big data analytics platform that allows you to have both advantages: on the one hand, it contains the internal constructs to feed a datalake store in a programmatic and configurable way. On the other hand, with other constructs, it allows you to use the same uery engine (based on KUSTO) to be able to search the data in the datalake.
This article talks about how to deal with the issue of generating a datalake and its consultation, through the use of the components inherent in ADX Data Explorer.
It is important to focus on this aspect because, as exciting as it is to have a tool like ADX that is capable of ingesting very large amounts of data, it is equally important (although less exciting) to immediately define the long-term strategy in order to maintain the window. necessary data on the database and move what is really not necessary (day by day) to a storage that can in any case be consulted.
The result to be achieved will be related to the speed of implementation of the db “rejuvenation” pipeline and transparent consultation, in which database and datalake mix together.
Regarding the problem to be addressed, the correct steps to approach this issue in a complete way are listed below:
- Create an External Storage to ADX (in our case ADLS);
- Use the External Table entity to map containers (directories) registered on datalake;
- Create a continuous Export process to periodically save the oldest data of the Fact Tables (Silver Table) in the mapped Datalake, exactly in the preconfigured locations (containers) for the External Table;
- Create a query baseline that materialize short and long term Storage in a single data source, and then perform storage-seamless queries;
EXTERNAL STORAGE CREATION
EXTERNAL TABLE DEFINITION
As for the setting of an extarnal table, it is important to immediately define the basic format with which datalake, exporter and retriever will be managed.
.create external table Archived_processcontrol_rawData (Timestamp:datetime, TagId:string, TagValue:string, assetCode:string)kind=adlpartition by (assetId:long = hash(assetCode, 10), Date:datetime = bin(Timestamp, 1d))pathformat = ("asset_Id/" assetId "/" datetime_pattern("yyyy/MM/dd", Date))dataformat=parquet(h@’abfss://firstname.lastname@example.org/;sharedkey=YNba4Dr0uxWjDBp2yicCofZF73EWk4qN3oemtr6Y+PjhqF7raQRvhr2EJxXM4m08C6hNEG2ziSW0jkdVra76=’)
Let’s focus on the type of storage and format:
The storage type is deliberately declared as a FOLDER of an ADLS Gen2. In this way, all the files exported by the exporter will end up there, partitioned by AssetId and then by date. Let us remember that the job of a processor is to perform single checks on the process data with respect to the set processing parameters, or to verify the performance of two Similar assets, with respect to productions of the same type. Having a Container for Customer (or for productive Plant) and Subdivide in directories divided by ASSET is the best condition. Let’s check how to use ADLS as source:
- Filesystem is the name of the ADLS filesystem object (roughly equivalent to Blob Container)
- StorageAccountName is the name of the storage account
- AccountKey indicates the storage account key
CREATE CONTINUOUS EXPORT
Now let’s set the continuous export from the Silvertable to the external table:
.create-or-alter continuous-export ExportToExternalTable over (silverstage_processcontrol_rawData)to table Archived_processcontrol_rawData with(intervalBetweenRuns=1d, forcedLatency=10m, sizeLimit=104857600)<| silverstage_processcontrol_rawData| where DataTimestamp < now() - 180d| project Timestamp=DataTimestamp, TagId=TagId, TagValue=Tag_value_string, assetCode=AssetId
Obviously it is necessary to have a mapping between the “live” table and the “archived” table.
CREATE DATA MERGE
You can query both external tables and ingested data tables within the same query. You can join or union the external table with additional data from Azure Data Explorer, SQL servers, or other sources. Use a let( ) statement to assign a shorthand name to an external table reference.
In the example below, “silverstage_processcontrol_rawData” is an ingested data table and “Archived_processcontrol_rawData” is an external table that we’ve defined previously:
let T1 = external_table("Archived_processcontrol_rawData") | where ts > ago(180d);
let T = silverstage_processcontrol_rawData; //T is an internal table
| join T on TagId
| take 10
Here are some references to Microsoft documentation:
- Use ExternalData to get all files from the external DIRECTORY (!!!)
- Create the External Table on ADSL Container and set it with ASSET + Y / M / D
- Configure Continuous Export
- How to query external data sotred in the DataLake involved in the continuous Export feed