diff --git a/aws/projects/edl/dataset-types.md b/aws/projects/edl/dataset-types.md new file mode 100644 index 00000000..218a9178 --- /dev/null +++ b/aws/projects/edl/dataset-types.md @@ -0,0 +1,10 @@ +# Dataset Types + +CMS is authoritative for its volume, also called the type, and considered a namespace. Like geo, decennial, econ, mixed, etc. CMS is responsible for all of the datasets underneath it. These become IRE datasets. An IRE dataset is a DMS dataset. It may be wholly sourced from CMS (say maybe external data). It may be a subset of a non-IRE dataset (say, something made internal like BR), a subset of columns or records from some other dataset. Let's say for BR they wanted it in IRE as a dataset, but without the SSN column. This would be a diff dataset, still called br. CODS maintains these datasets. + +DMS is authoritative for its volumes, and those cannot be the same names (type or type_id) as one in CMS. So, we have a prefix of edl- on these to distinguish them, and they have different type_id values (as these go into the posix group). In the example above, the full BR dataset would be edl-econ/br. A single namespace cannot be authoritative in both systems. The data owner maintains these datasets. + +The path on a file system, and in S3 or whaterver, must preserve these types: + +/data/{type}/{group}/{instance} +