You are new to Predix Studio and attempting to build your first Studio app. You might have tested some queries using the Indexer, and maybe even set up some charts using sample data. But how do you start working with your own data? How do you link your data sources to Predix Studio?
The Data Management Workbench (DMW) is Predix Studio's fully integrated, intelligent data schema discovery and modeling feature. By using an AI engine to quickly learn about your data, the DMW makes it easy to create models and set up the data ingestion pipeline.
This guide uses traffic data from the Chicago Data Portal to illustrate how to use the DMW. You will use the DMW to perform the following tasks.
Create a project.
Use the AI engine to learn about your data.
Create and publish a visualization model.
And finally, create adapters to ingest data.
Dependencies:
Access to Predix Studio
The Project page is used to create and manage Data Management projects. This section shows how to create a project, update the project's options and description, review the log records created for the current project session, and optionally delete the project entirely.
In this phase, you will create a new project to manage the Chicago data.
From the Data Integration menu, choose Data Management Workbench.
Click New Project, then enter com.chicago
in the Project Name field.
Click Insert, and verify that you see the following page.
Here's the Project page breakdown:
The menu on the left shows the current phase you are on and the next available phase (Project in this example).
Options: This XML code specifies how the AI engine learns from your data sources. This will be relevant in the Explore phase of the project. Check out Project Options for more details on configuration.
Description: A brief description of your project.
Project Log: Records of actions completed.
For more details about this phase including configuring options, check out the Project Section documentation on Predix.io.
To learn more about a data set and generate a model, the source data files must be added to the project. This section shows you how to list the available sources and then add them to your project.
Here you will add the Chicago data to the DMW, and then move them into your project.
Step 1: In the left navigation menu, choose Source.
Source page breakdown:
The phase timeline on the top left shows the status of your project. (The orange label indicates the current phase (Source) is in progress.)
Data Sources: Sources that are added to your project
Available Data Sources: Sources in the workbench that can be added to your project.
Step 2: Download the sample CSVs and drag/drop them into the Available Data Sources grid.
You should see them upload and appear in the grid.
Step 3: Click the green arrows next to each Chicago data CSV to move them into your project.
You might notice the Source label in the status timeline changed colors from orange to green. That means you have successfully added your sources to the project.
To learn more about configuring sources such as HTTP sources, check out the guide on Adding Data Sources to a Project.
This section shows you how to trigger the engine to "learn" about your data - from the data itself. It learns about connections and entities, verifies the keys, then trains its classifier. The engine then generates what it has learned in the form of a canonical model.
In this phase, the DMW explores the structure of the Chicago data you added. The AI engine will create a model for you to start working with.
In the left navigation menu, choose Explore.
Explore page breakdown:
Entities: These are the entities that your data sets represent.
Learned Unique Keys: Learned unique keys from your data sets.
Data Source Fields: The fields of your data sources with column index.
Learned Connections: Deductions of what the most likely connections between entities are.
Learned Features: Types of data each source field represents.
Click Learn Entities.
The Learn Entities label turns yellow, indicating the AI engine is processing. This may take a few moments.
After the AI engine finishes processing, the label turns green. The engine learned some unique keys from the traffic data. The Learned Unique Keys and Data Source Fields data grids should be populated.
Click Verify Keys and wait for the engine to test the new keys.
The engine tests how likely the learned keys are unique within each entity. The % Confidence column in the Learned Unique Keys should be populated.
Note: Notice that for the entity ChicagoCongestion, the regionId uniquely identifies the entity with 100% confidence, but other fields (and a combination of fields) may also identify this entity.
Click Learn Connections to discover how the entities connect with each other.
The engine deduces most likely connections (relationships) between entities. The Learned Connections and Learned Features data grids should be populated.
Take note of the field relationship in the first row (id > regionId). We will discuss if this connection is valid in the Visualize phase. - Click Train Classifier and wait for the engine to store recent learnings.
The classifier is used in the Ingest phase to map data source fields with the fields of the model you are creating.
If you want to stop a step-in progress such as Learn Entities, you can click Stop Learning.
To delete your learning data and restart the learning process, you can click Clear Learning. (This displays a confirmation prompt.)
To learn more, check out the documentation on the Explore Section.
This section shows you how to view, explore, validate, and manipulate the generated canonical model and then save it for future use.
In this phase, you see a visual model of what the AI engine learned in the Explore phase, and you will verify the relationships between the entities.
In the left navigation menu, choose Visualize.
Visualize page breakdown:
The main view displays the visual of your model. Each point represents an entity, and between them, connections, if they exist.
Save: Saves your changes to the model.
Undo: Undoes changes made to the model.
You can click and drag the entities in the model to rearrange the view
Step1: Click the discovered connection between ChicagoCongestion and ChicagoTraffic.
Step 2: Then click the single connection to expand the discovered connection details.
Looks like the engine found the regionId
values of ChicagoCongestion contained in the id
values of ChicagoTraffic.
Is this a useful connection, or an unfortunate coincidence? A local table row id contained a set of values [1, 2, 3, ...] that another entity contained in a data field [1, 2, 3, ...]. In this case, it is undesired.
Step: 3: Click the detailed connection data between ChicagoParkingPermits and ChicagoCongestion.
Looks like the machine found another undesired (false positive) connection between ChicagoCongestion regionId
and ChicagoParkingPermits id
.
Step 4: Remove each of the discovered connections. (They are all undesired because the entities are independent of each other and there is no hierarchy.)
a. Right-click each connection.
b. Then click the grey Save button on the bottom right of the screen.
c. Click Confirm Save and you will see some notifications on the top right.
To get a better handle on understanding and modifying your models visually, check out this guide on Graph Tools.
After you have finished the basic model manipulation by examining and trimming the models in the Explore and Visualize stages, you can then define them as local data types in the indexer and publish them. You can modify the models before publishing.
Each entity and its fields are housed inside a package. In this phase, you rename the packages appropriately and then publish them in a new model.
Step 1: In the left navigation menu, choose Publish.
Publish page breakdown:
Packages: List of packages
Objects: List of fields within a selected package
Step 1: Change the name of each package to reflect the entity each one contains.
Step 2: Click Package1 from the Packages list, then click the edit (pencil) icon to edit the name.
Step 3: Rename Package1 to StreetNames.
Step 4: Then click the save icon to save changes.
Step 5: Repeat the above step to rename the others.
Package2 = Congestion
Package3 = VehiclesTowed
Package4 = ParkingPermits
Package5 = Traffic
Step 6: Click the grey Publish All button on the bottom-right side of the page to publish all the packages into the Chicago model.
Step 7: In the Model Selection dialog, click the blue Create New Model button.
Step 8: Enter chicago
into the Model Name field, and then click Create Model and Publish Packages.
You have now created and saved a new model (com.ged-chicago)
based on what was learned in the process.
To learn more about modifying objects and models, check out the documentation on the Publish Section.
This section walks through how to map the source data to newly created or existing models, and then create and execute adapters to ingest the source data.*
Step 1: In the left navigation menu, choose Ingest.
You can use this page to map the source data fields to the model you created.
Step 2: Click the add (+) button, next to the Analyze All button.
Step 3: In the Model Selection dialog, select your model.
Step 4: Click Set Selected Models to close the dialog.
Step 5: Then click Analyze All.
The Data Mappings grid should now contain the mappings between the source data fields (normalized and labeled under Source Field) and the model fields (under Target Field, with the format objectname_fieldname
).
For more information on this process, check out the guide on Ingest Data.
Now you can run the adapter to ingest the source data using the Chicago model.
Step 6: Click the grey Run Adapter button on the bottom left corner of the page. An orange warning appears in the top right, and it says that you need to create storage (the indices before the data can be ingested.
Step 7: Click Manage Models in the top right corner.
Step 8: Click the Chicago model, and then click Create Storage.
Step 9: Click the Run Adapter button again to ingest the data. You should see all green badges in the progress bar at the top.
Now that you've ingested the data, you can test a query using the Indexer. The following example shows a query for all vehicles towed in Chicago and the result.
How to Use Significant Terms Aggregation With Director Service Debugger (Beta 10.4)
How to Use Statistical Metrics Aggregations With Director Service Debugger (Beta 10.4)
How to Use Percentiles Aggregation With Director Service Debugger (Beta 10.4)
How to Use Terms Aggregation With Director Service Debugger (Beta 10.4)
How to Use Cardinality Aggregation With Director Service Debugger (Beta 10.4)
How to Use Histogram Aggregation With Director Service Debugger (Beta 10.4)
How to Use the Date Histogram Aggregation With Director Service Debugger (Beta 10.4)
How to Use the Range Aggregation With Director Service Debugger (Beta 10.4)
How to Use the IP Range Aggregation With Director Service Debugger (Beta 10.4)
How to Use the Date Range Aggregation With Director Service Debugger (Beta 10.4)