Data architecture for IIoT
In order to ensure that we are handling data properly and are leveraging the six characteristics of data, we need to have a well-defined data strategy for IIoT solutions. The data strategy should address each layer in the flow of data across the solution. The following diagram captures the overall data architecture for IIoT solutions, followed by a description of each of the layers:
As shown in the preceding figure, it's important to think of the overall data architecture as split into separate layers that interact through interfaces. This is what will add scalability to the architecture, as you can scale each layer up and down as per the overall system requirements. It will also add to the reliability of the system as the failures and issues in one layer can be separated and handled appropriately.
A brief description for each layer is provided here:
- Data Capture Layer: In modern IIoT systems, data will be captured from sensors, which will gather logs for operational (OT) data and also from ERP/HR/CRM tools and other enterprise data stores that are more IT-specific. There are other data types, such as documents, emails, and web data/logs that can be captured to combine with the IT/OT datasets to generate meaningful insights. The challenge in capturing datasets is the use of a variety of protocols that are used by various devices and process historians. Companies such as PTC and startups such as Atomiton provide solutions that abstract out the protocol-specific details so developers can concentrate on the business logic for the applications rather than dealing with the protocol-specific details of these IIoT devices.
- Data Ingestion Layer: The ingestion layer needs to be built on technologies that will help move the captured data in its current form to the storage and analytics layers. The first-time transfer of data from the edge to the storage/analytics layer will require something like Talend to do a batch transfer. For IT data that resides in existing ERP systems and CRM tools, the bulk transfer of data is possible using tools such as Sqoop. Solutions requiring more online analytics applications collecting log data will require something like Flume from the Apache set of technologies. Once the initial transfer of a batch or a large dataset is completed, a solution like HVR can be used for change data capture (CDC). Further data parsing and lineage capture during the ingestion phase is possible using Informatica and real-time insights on time series data can be captured using tools like Druid.
- Data Messaging Layer: In order to optimize data transfer rates and implement high performance use cases, it is essential to put in a messaging infrastructure that will help parallelize and properly direct data loads. The most commonly used messaging infrastructure technologies are Kafka and Rabbit MQ.
- Data Storage Layer: Given that we need to capture a variety of data types that have different profiles and characteristics, we need to leverage more than one technology to provide effective storage for IIoT solutions. Typically, an IIoT solution will capture minimal amounts of data on the edge with most of the data for batch processing being transferred to the cloud. Some form of data lake serves as the landing ground for most data dumped in its raw form, with a schema-on-read rather than schema-on-write methodology being followed. Such a data lake should be built on various technologies to serve different needs. Some are listed here and the details are provided in the following table—HDFS for batch processing, GreenPlum for faster real-time processing, NoSQL storage platforms such as Cassandra for time series, Blobstore for static web content, images and multimedia, and a GraphDB like Neo4j for use cases requiring the storage of highly connected datasets that need to save transformations.
- Data Analytics Layer: This layer is where the analytics is performed on the data, and this requires the proper tools depending on what specific analytics need to be performed on the datasets. For example, if there is the need to do real-time analytics on the data, then Storm would be a good choice, whereas if the need is to perform more interactive SQL-like analytics, then SparkSQL would be a good choice. If the need is to do data science analysis, then R Studio, Anaconda, and related data science libraries will be required to build models and train them on the sample datasets collected.
- Data Caching Layer: In order to improve performance for reporting periodic, reported, or aggregated datasets, it's important to have a caching layer to provide the in-memory availability of data for consumption. This also provides a layer of abstraction of the consumption layer from the storage layer before it is finally consumed in some form (such as reports).
- Data Consumption: Depending on the use case, the consumption of the data will be in the form of dynamically generated reports, dashboards for quick viewing, and real-time actions, as well as queries used for creating data marts and useful extracts. Several tools are available in the market that can be used, such as Tableau and Spotfire; these can be used to build this layer.
This architecture is the base-level enterprise architecture for IIoT solutions; however, as your IIoT solutions scale and are required to handle petabytes of data from multiple clients, we need to take this architecture one step further and make it capable of being multitenancy and distributed. This can be achieved by leveraging the virtualization and containerization of the caching and analytics (transformation and aggregation) layers. This is depicted in the following figure:
