Creating a Data Package

This vignette will show how to create and edit Data Packages.

Creating a Data Package

The new_datapackage() function creates a new Data Package

> library(datapackage)
> dir <- tempfile()
> dp <- new_datapackage(dir, name = "example", 
+   title = "An Example Data Package")
> dp
[example] An Example Data Package

Location: </tmp/RtmpQEaJp2/file41e6ac10d5a>
<NO RESOURCES>

This will return an editabledatapackage. This means that any changes to the Data Package are immediately saved to the datapackage.json file and when reading any properties these are read from the file. It is, therefore, possible to manually edit the datapackage.json file while working in R with the Data Package.

> list.files(dir)
[1] "datapackage.json"

Using methods such as dp_title() and dp_description() the properties of the Data Package can be modified.

> dp_description(dp) <- "This is a description of the Data Package"

The description<-() method also accepts a character vector of length > 1. This makes it easy to read the contents of the description from file as it can be difficult to write long descriptions directly from R-code. It is possible to use markdown in the description.

dp_description(dp) <- readLines("description.md")

The following methods a currently (when writing the vignette) supported:

dp_title<-()
dp_contributors<-() and dp_add_contributor<-()
dp_description<-()
dp_id<-()
dp_name<-()
dp_created<-()
dp_keywords<-()
dp_property<-(): this function also allow custom properties.

For an up to data list run the following:

[.R #n5} methods(class = "datapackage") |> (\(x) x[grep("<-", x)])()

Below an example of adding a contributor to the package

> dp_add_contributor(dp) <- new_contributor("Jane Doe", role = "author",
+   email = "j.doe@organisation.org")

Adding a dataset to the datapackage

In this example we will save the iris dataset to a new datapackage.

> data(iris)
> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

In order to store a new dataset in a Data Package we need to do two things. First, we need to create a new Data Resource in the package. Second, using the specification of the Data Resource we need to save the actual dataset at the location specified in the Data Resource.

It is possible to edit the datapackage.json file to create the new Data Resource. The package also has a function dp_generate_dataresource() to generate a skeleton Data Resource for a given dataset:

> res <- dp_generate_dataresource(iris, "iris")

Again these can be further modified using methods such as dp_title() and dp_property():

> dp_title(res) <- "The Iris dataset"

Let’s add the resources to the Data Package.

> dp_resources(dp) <- res

In this case the Data Package does not yet contain Data Resources. Should the Data Package contain Data Resources with the same name, these will be overwritten by the new Data Resource.

We are now ready to write the dataset. For this we can use the dp_write_data() method:

> dp_write_data(dp, resource_name = "iris", data = iris)

When some of the field in the Data Resource have categories that are stored in a separate Data Resource, this function will by default also write any categories lists associated with the Data Resource.

> readLines(file.path(dir, "iris.csv"), n = 10) |> writeLines()
"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
5.1,3.5,1.4,0.2,1
4.9,3,1.4,0.2,1
4.7,3.2,1.3,0.2,1
4.6,3.1,1.5,0.2,1
5,3.6,1.4,0.2,1
5.4,3.9,1.7,0.4,1
4.6,3.4,1.4,0.3,1
5,3.4,1.5,0.2,1
4.4,2.9,1.4,0.2,1

And of course we can open the Data Package and read the data back in:

> dp2 <- open_datapackage(dir)
> iris2 <- dp2 |> dp_resource("iris") |> dp_get_data(convert_categories = "to_factor")
> all.equal(iris, iris2, check.attributes = FALSE)
[1] TRUE

More on categories

By default dp_generate_dataresource() will generate categories properties for factor fields:

> data(chickwts)
> res <- dp_generate_dataresource(chickwts, "chickwts") 
> dp_resources(dp) <- res
> (feed_name <- dp_resource(dp, "chickwts") |> 
+   dp_field("feed") |> dp_property("categories"))
[[1]]
[[1]]$value
[1] 1

[[1]]$label
[1] "casein"


[[2]]
[[2]]$value
[1] 2

[[2]]$label
[1] "horsebean"


[[3]]
[[3]]$value
[1] 3

[[3]]$label
[1] "linseed"


[[4]]
[[4]]$value
[1] 4

[[4]]$label
[1] "meatmeal"


[[5]]
[[5]]$value
[1] 5

[[5]]$label
[1] "soybean"


[[6]]
[[6]]$value
[1] 6

[[6]]$label
[1] "sunflower"

Here, the list of categories is stored directly in the categories property. It is also possible to store the list of categories in a Data Resource

> res <- dp_generate_dataresource(chickwts, "chickwts", 
+   categories_type = "resource") 
> dp_resources(dp) <- res
> (feed_name <- dp_resource(dp, "chickwts") |> 
+   dp_field("feed") |> dp_property("categories"))
$resource
[1] "feed-categories"

Here the categories property points to Data Resource. dp_write_data() will automatically create this resource by default when writing the data:

> dp_write_data(dp_resource(dp, "chickwts"), data = chickwts, write_categories = TRUE)
> list.files(dir)
[1] "chickwts.csv"        "datapackage.json"    "feed-categories.csv"
[4] "iris.csv"           
> dp_resource(dp, "feed-categories") |> dp_get_data()
  value     label
1     1    casein
2     2 horsebean
3     3   linseed
4     4  meatmeal
5     5   soybean
6     6 sunflower

By default the package will generate a list of categories for factor variables. The levels will be numbered using sequential integers starting from 1. The example below shows how different codes can be used.

In order to write the correct codes we will also first have to generate the and save the dataset with the correct codes. In the example below we do this using R, but it is of course also possible to generate the CSV using other methods (e.g. manual editing):

> codelist <- data.frame(
+   value = c(101, 102, 103, 202, 203, 204),
+   label = c("casein", "horsebean", "linseed", "meatmeal", 
+     "soybean", "sunflower")
+ )
> res <- dp_generate_dataresource(codelist, "feed-categories")
> res
[feed-categories] 

Selected properties:
path     :"feed-categories.csv"
format   :"csv"
mediatype:"text/csv"
encoding :"utf-8"
schema   :Table Schema [2] "value" "label"
> dp_resources(dp) <- res
> codelistres <- dp |> dp_resource("feed-categories")
> dp_write_data(codelistres, data = codelist, write_categories = FALSE)

This creates the correct CSV-files:

> readLines(file.path(dir, "feed-categories.csv")) |> writeLines()
"value","label"
101,"casein"
102,"horsebean"
103,"linseed"
202,"meatmeal"
203,"soybean"
204,"sunflower"

When we now write the dataset to file it will use this dataset - as long as we don’t overwrite it. Therefore, the write_categories = FALSE:

> dp_write_data(dp, resource_name = "chickwts", data = chickwts, write_categories = FALSE)

We can see that the correct codes are used in the CSV-file:

> readLines(file.path(dir, "chickwts.csv"), n = 10) |> writeLines()
"weight","feed"
179,102
160,102
136,102
227,102
217,102
168,102
108,102
124,102
143,102

Editing an existing Data Package

Editing of existing Data Packages is also possible. Use the readonly = FALSE argument when opening the Data Package:

> edit <- open_datapackage(dir, readonly = FALSE)
> dp_id(edit) <- "iris_chkwts"
> dp_created(edit) <- Sys.time() |> as.Date()

Showing the complete datapackage.json file after all of the edits in this vignette:

> readLines(file.path(dir, "datapackage.json")) |> writeLines()
{
  "name": "example",
  "title": "An Example Data Package",
  "resources": [
    {
      "name": "iris",
      "path": "iris.csv",
      "format": "csv",
      "mediatype": "text/csv",
      "encoding": "utf-8",
      "schema": {
        "fields": [
          {
            "name": "Sepal.Length",
            "type": "number"
          },
          {
            "name": "Sepal.Width",
            "type": "number"
          },
          {
            "name": "Petal.Length",
            "type": "number"
          },
          {
            "name": "Petal.Width",
            "type": "number"
          },
          {
            "name": "Species",
            "type": "integer",
            "categories": [
              {
                "value": 1,
                "label": "setosa"
              },
              {
                "value": 2,
                "label": "versicolor"
              },
              {
                "value": 3,
                "label": "virginica"
              }
            ]
          }
        ]
      },
      "title": "The Iris dataset"
    },
    {
      "name": "chickwts",
      "path": "chickwts.csv",
      "format": "csv",
      "mediatype": "text/csv",
      "encoding": "utf-8",
      "schema": {
        "fields": [
          {
            "name": "weight",
            "type": "number"
          },
          {
            "name": "feed",
            "type": "integer",
            "categories": {
              "resource": "feed-categories"
            }
          }
        ]
      }
    },
    {
      "name": "feed-categories",
      "path": "feed-categories.csv",
      "format": "csv",
      "mediatype": "text/csv",
      "encoding": "utf-8",
      "schema": {
        "fields": [
          {
            "name": "value",
            "type": "number"
          },
          {
            "name": "label",
            "type": "string"
          }
        ]
      }
    }
  ],
  "description": "This is a description of the Data Package",
  "contributors": [
    {
      "title": "Jane Doe",
      "role": "author",
      "email": "j.doe@organisation.org"
    }
  ],
  "id": "iris_chkwts",
  "created": "2025-03-13"
}