From 6e53a2487f9be3f66e6c4f2564b26c84e90682bd Mon Sep 17 00:00:00 2001 From: badra001 Date: Fri, 11 Jul 2025 15:31:40 -0400 Subject: [PATCH] add dataset types snippet --- aws/projects/edl/dataset-types.md | 10 ++++++++++ 1 file changed, 10 insertions(+) create mode 100644 aws/projects/edl/dataset-types.md diff --git a/aws/projects/edl/dataset-types.md b/aws/projects/edl/dataset-types.md new file mode 100644 index 00000000..218a9178 --- /dev/null +++ b/aws/projects/edl/dataset-types.md @@ -0,0 +1,10 @@ +# Dataset Types + +CMS is authoritative for its volume, also called the type, and considered a namespace. Like geo, decennial, econ, mixed, etc. CMS is responsible for all of the datasets underneath it. These become IRE datasets. An IRE dataset is a DMS dataset. It may be wholly sourced from CMS (say maybe external data). It may be a subset of a non-IRE dataset (say, something made internal like BR), a subset of columns or records from some other dataset. Let's say for BR they wanted it in IRE as a dataset, but without the SSN column. This would be a diff dataset, still called br. CODS maintains these datasets. + +DMS is authoritative for its volumes, and those cannot be the same names (type or type_id) as one in CMS. So, we have a prefix of edl- on these to distinguish them, and they have different type_id values (as these go into the posix group). In the example above, the full BR dataset would be edl-econ/br. A single namespace cannot be authoritative in both systems. The data owner maintains these datasets. + +The path on a file system, and in S3 or whaterver, must preserve these types: + +/data/{type}/{group}/{instance} +