
| boxes to fill in | type of box | what is in the drop-down menu? | what is this? which data set is it attached to? what kind of information goes there? how do we know what to choose? |
|---|---|---|---|
| Entity name | free-form | enter, a name, often matching an entity in SEAD that we have not yet created an entity for in this project | |
| Type | drop-down menu | Data (derived) SQL Query Fixed Values CSV File Excel File (Pandas) Excel File (OpenPyXL) | from where are we going to extract the data to transform? |
| Source Entity (only present for certain Types above) | drop-down menu | (drop down contents change based on Type) | the incoming data set |
| System ID | pre-set | don’t enter, it fills in self | |
| Public ID | free-form | the name of the ID in SEAD’s system, so for sites, enter site_id | |
| Business Keys | free-form | the list of unique set of columns in the incoming data that are needed to uniquely define each row of data for the data set | |
| Columns | free-form | ||
| Depends On | free-form | used to define dependencies. In some datasets every site has to have a defined site type. the “before you can do X you need to know Y” sorts of information note: if you chose type “derived” then the dependency is already defined on what you derived it from If this step is done wrong, the validation will fail | |
| Drop Duplicates | check box | this step is how you get a complete list of each unique catagory of data. If we are making a list of all sites in a dataset, How needed this is will depend on the incoming dataset. The list of columns works like Excel’s drop duplicates | |
| Check Functional Dependency | check box | This button asks the program to confirm that if one considers only the columns entered in the “drop duplicates” column that three aren’t examples of (for example) a single site with multipile coordinate systems | |
| Drop Empty Rows | check box | useful if there were blank rows | |
| Deduplication Columns | free-form | define the columns needed to get a unique set (example site name plus national site id ) | |
| Colums to Check for Empty Values | free-form | use for columns so important that that it doesn’t mater if data exists in other columns | |
| SQL Query (only present if Type = SQL Query) | free-form | (note: use the ) | write a query that extracts all the pertanate columns from the data set for the table you are working on |
| Data (derived) we already have data in an entity we have already defined in Shape Shifter | |||
| SQL Query use if you want to use an SQL query to extract data from a data source | |||
| Fixed Values | |||
| CSV File | |||
| Excel File (Pandas) | |||
| Excel File (OpenPyXL) |
ETL = extracts data, transforms it, and loads it
So for SQL we write a query to extract all of the site related data(for example)
Fixed values: use if you have data on paper, and it needs entering to be machine readable, or it is a data type that doesn’t exist in the incoming dataset, but is required in SEAD
Next: 2.1.2 Foreign Keys Tab