Data Analysis Principles of Data Discovery, Cataloging and Synchronization

This page provides some background on orchestrating the YOUnite process when defining data domains in a Federated data management system. The process of moving an organization towards federated data discovery, cataloging and synchronization is rigorous but, if done properly, it provides a single interface for synchronization, governance, and data event notifications.

Requirements for YOUnite can differ greatly between organizations. At one end of the spectrum the focus is:

Data synchronization: Synchronizing data between source systems

While at the other end its:

Data fabric: Feed graph databases and/or data warehouses to form data graphs/mesh’s so decision makers have access to the permission-appropriate enterprise data for search, data analytics, AI and ML. The benefit of starting with this approach is that avoids real-time synchronization in the initial phases and puts the emphasis on making data available for decision makers i.e. data is key for a companies success.

Start by Analyzing the Use Cases

Generally it’s best to lead with the use cases and limit your initial YOUnite deployment to just a few, gradually connecting more and more of the organization’s ecosystem to YOUnite. Use cases often equate to storyboarding but keep in mind, this is not application storyboarding but data synchronization, governance, and notification storyboarding. We want the stakeholders of the applications and data in the organization to specify their needs for data. This includes the following:

Is there an immediate need for real-time source system synchronization or can it be postponed for starting with data fabrics.
What are the source systems tied to the use cases?
Who are the stakeholders for the use cases e.g. Data and Application Architects, Business Managers, etc.?
How do the source systems connect to their data?
What data elements in the source systems matter to the stakeholders?
- Start building data dictionaries of how the various source systems model the data.
- Stakeholder descriptions: For each stakeholder, describe the systems where the "truth" data elements live (see next step) and what notifications they need to receive.
Data synchronization and notification storyboarding. We want the application and data stakeholders in the organization to specify their realtime needs for "the truth." This includes descriptions of how the data will be used and which applications need to be notified when changes occur.

From the data dictionaries and stakeholder descriptions a clear picture starts to take shape for:

Data domains
Adaptor development and capabilities
Governance requirements
Data event notification needs
Data virtualization
Data graphs

To Establish the Truth or Not?

Out of analysis you discover the truth, i.e. which systems hold the truth values for a given data domain. As you catalogue the data elements in a data dictionary it is important to note which systems hold the truth for the various stakeholders (zones). Knowing this reduces the amount of analysis required by creating a minimum-possible set of data elements for a given data domain.

If source-system synchronization is the primary goal, It’s also important to understand that different zones can have a different view of which systems hold the truth values for a given domain; this too must be documented as data elements for a given data domain are catalogued. Allowing different zones to define where their source of truth originates is one of the distinguishing features of YOUnite.

If a data mesh/fabric is the goal, the stakeholders will want all of the data they have permission to so they can search, run analytics and AI/ML. Decision makers often want to search the enterprise data like they search the web. In this case a gold truth record isn’t as important as having as much data as possible to make decsions.

Note: A zone refers to a collection of systems/applications owned by groups inside of an organization.

As the data governance staff works through the process of federated data synchronization, "truth" is often defined by the Data Governance Steward (DGS). But YOUnite provides the flexibility that allows the Zone Data Steward (ZDS) to define effective federated data. In other words, "what may be truth for one zone or, the organization as a whole (what is defined as data by the DGS) may not be data for another."

Example: In a college system, the truth for the “name” elements (first, last, etc.) for the student attribute is stored in both the College Application system and the College’s SIS. An LMS at a college should receive student name and email address updates when they are made in the College Application system or the SIS but, the converse is not true i.e. the College Application system and SIS do not want student name changes made from the LMS (since name changes made at the college should only be handled by staff with the appropriate permissions to do so). In this case real-time data synchronization is required. Knowing the objectives helps rule out any concerns over sending data from the LMS to other systems and focus primarily on how data will flow from either the Application System or the SIS into other systems, such as the LMS.

Another Example: College researchers are concerned about student success and want all of the data anonymized and fed into a graph database so they can start building AI algorithms to undertand how to better increase student success.

Think in Terms of REST

Asking use-case questions in terms of RESTful operations (HTTP GET, PUT, POST, and DELETE — following REST principles can help keep analysis focused. Ultimately, YOUnite breaks transactions down into RESTful operations and if you know which operations to avoid then a lot of time can be saved.

Example: The College Application system never wants to delete a student once they have been added to the system. Since this is the case, analysis for the DELETE request can be ignored with this application.

The YOUnite Process is a Multi-Dimensional Cross-Cutting Concern

The following two areas must be analyzed:

The needs of performing specific operations within each system
Attributes stored in those systems and their data elements

For each of the required HTTP operations (GET, PUT, POST, DELETE) in a RESTful context.

This analysis uncovers most of the challenges and metadata needed (metadata is data about data—it is not part of the actual data record but is required to properly store the data record).

Example: Incoming freshmen at a college need to take an assessment test to determine which English and Math courses they should be placed into. The assessment holds raw test scores and the SIS system wants to combine the assessment scores with past college and high school course scores from the student’s transcripts and, from there, create its own score. In other words, the SIS wants the assessment tests but it does not store the assessment test scores - it only uses them as a function of creating a course placement ranking.

Adaptors are software located within a system that shares data throughout the YOUnite Data Fabric and acts as the connection point between that system and the Data Fabric. In the example above, adaptors are DI custom software that connects the application (e.g. SIS, Assessment, etc.) to YOUnite. They map data domains (and metadata) to operations in the application and follow protocols about data transformation and data governance i.e. who can see/update what. YOUnite provides fine-grained data governance controls between groups inside an organization.

It is easiest to think in the following terms and build "Data Domain Worksheets" as follows:

DELETE or GET or POST Entity → \{adaptor1, adaptor2…adaptorN}

PUT Entity?attribute=key&value=value → \{adaptor1, adaptor2…adaptorN}

Ultimately, the data architects create a worksheet that contains the required attributes to complete an operation for a given entity for a given adaptor.

Data Cataloging and Discovery

Before source systems are brought online with YOUnite, the adaptors must run a discovery process to find records and attempt to match them to existing records in the YOUnite catalog.

Even Though Data Domains Can Be Modeled as Multi-Dimensional Doesn’t Mean They Should Be

The JSON modeling tool with YOUnite is very powerful in that it allows a data architect to create very complex inter-dependencies between data domains, which should be avoided. When designing data domains, relational database principles should be followed. The following points illustrate a couple of pitfalls to avoid when building structurally-complex data domains:

If a domain domain has nested levels of objects and arrays, it’s typically a good candidate for being broken out into multiple domains
Arrays inside of a domain can create governance issues where one zone may not have governance to an entire array. If this is a possibility, the array should probably be broken out into another data domain where governance can be managed

To summarize, following sound relational database principles we will create a federated data fabric with data records that are easier to manage and to apply governance to.

If an HTTP Operation Is Not Required for an Adaptor, Don’t Analyze It

Example: There is never a situation where the analysts for the College Application system wants YOUnite to create (POST) a new student; they need to maintain control of that process. There is no need to analyze the required elements for a POST /student for the College Application system.

Generally Speaking, All Changes to a Data Record Should Generate a Change Event to All Adaptors Interested in That Data Domain

If an application tied to an adaptor has a well-written RESTful interface, it will allow you to register a callback for changes. If not, then you will need to discover a way to detect changes.

Additionally, all new and deleted resources should generate a notification (this is a YOUnite feature).

Example: A college course catalogue system would not get a notification that a student has been deleted from the system but several other systems would, such as the College Application system and the college SIS.

Note: If data synchronization is happening outside of YOUnite there is a good possibility that these synchronizations won’t be detected by YOUnite and the benefits of unified data governance and data event notifications won’t be realized. There is a possibility that it can work against the entire process. Even worse, cycles of repeated unnecessary (possibly infinite) updates can be started. For information on Data Governance and developing an Array Advisory Practice to be communicated to adaptor developers for how to handle updated arrays, see Data Domains: Array.

If Data Elements Are Used by Only One System, Then Don’t Normalize Them Unless They Are Used Inside Another Data Domain

The job of the data analyst is to create as little work as possible. A single element added to a federated data domain has an exponential effect on the complexity of the overall system.

Example: A college system uses an Ed Planning system that tracks meetings between the student and college faculty and staff. Others systems may use the Ed Planning data but if no other systems in the systems use the scheduling system, then the scheduling data can be ignored in respect to modeling student, faculty, or college data domains.

The Process is Iterative

Start small and gradually connect more applications and services in the organization to the YOUnite Data Fabric.

A Couple of Additional Points

The YOUnite adaptor might need to read and manipulate metadata to complete transactions.
When building an YOUnite worksheet you also need a reference data worksheet. This is data that infrequently changes (e.g. States, Countries, etc.) but is commonly cross-referenced by other domains (e.g. customers).