This vignette will show how to create and edit Data Packages.
The new_datapackage()
function creates a new Data
Package
> library(datapackage)
> dir <- tempfile()
> dp <- new_datapackage(dir, name = "example",
+ title = "An Example Data Package")
> dp
[example] An Example Data Package
Location: </tmp/RtmpQEaJp2/file41e6ac10d5a>
<NO RESOURCES>
This will return an editabledatapackage
. This means that
any changes to the Data Package are immediately saved to the
datapackage.json
file and when reading any properties these
are read from the file. It is, therefore, possible to manually edit the
datapackage.json
file while working in R with the Data
Package.
Using methods such as dp_title()
and
dp_description()
the properties of the Data Package can be
modified.
The description<-()
method also accepts a character
vector of length > 1. This makes it easy to read the contents of the
description from file as it can be difficult to write long descriptions
directly from R-code. It is possible to use markdown in the
description.
The following methods a currently (when writing the vignette) supported:
dp_title<-()
dp_contributors<-()
and
dp_add_contributor<-()
dp_description<-()
dp_id<-()
dp_name<-()
dp_created<-()
dp_keywords<-()
dp_property<-()
: this function also allow custom
properties.For an up to data list run the following:
[.R #n5} methods(class = "datapackage") |> (\(x) x[grep("<-", x)])()
Below an example of adding a contributor to the package
In this example we will save the iris
dataset to a new
datapackage.
> data(iris)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
In order to store a new dataset in a Data Package we need to do two things. First, we need to create a new Data Resource in the package. Second, using the specification of the Data Resource we need to save the actual dataset at the location specified in the Data Resource.
It is possible to edit the datapackage.json
file to
create the new Data Resource. The package also has a function
dp_generate_dataresource()
to generate a skeleton Data
Resource for a given dataset:
Again these can be further modified using methods such as
dp_title()
and dp_property()
:
Let’s add the resources to the Data Package.
In this case the Data Package does not yet contain Data Resources. Should the Data Package contain Data Resources with the same name, these will be overwritten by the new Data Resource.
We are now ready to write the dataset. For this we can use the
dp_write_data()
method:
When some of the field in the Data Resource have categories that are stored in a separate Data Resource, this function will by default also write any categories lists associated with the Data Resource.
> readLines(file.path(dir, "iris.csv"), n = 10) |> writeLines()
"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
5.1,3.5,1.4,0.2,1
4.9,3,1.4,0.2,1
4.7,3.2,1.3,0.2,1
4.6,3.1,1.5,0.2,1
5,3.6,1.4,0.2,1
5.4,3.9,1.7,0.4,1
4.6,3.4,1.4,0.3,1
5,3.4,1.5,0.2,1
4.4,2.9,1.4,0.2,1
And of course we can open the Data Package and read the data back in:
By default dp_generate_dataresource()
will generate
categories
properties for factor fields:
> data(chickwts)
> res <- dp_generate_dataresource(chickwts, "chickwts")
> dp_resources(dp) <- res
> (feed_name <- dp_resource(dp, "chickwts") |>
+ dp_field("feed") |> dp_property("categories"))
[[1]]
[[1]]$value
[1] 1
[[1]]$label
[1] "casein"
[[2]]
[[2]]$value
[1] 2
[[2]]$label
[1] "horsebean"
[[3]]
[[3]]$value
[1] 3
[[3]]$label
[1] "linseed"
[[4]]
[[4]]$value
[1] 4
[[4]]$label
[1] "meatmeal"
[[5]]
[[5]]$value
[1] 5
[[5]]$label
[1] "soybean"
[[6]]
[[6]]$value
[1] 6
[[6]]$label
[1] "sunflower"
Here, the list of categories is stored directly in the
categories
property. It is also possible to store the list
of categories in a Data Resource
> res <- dp_generate_dataresource(chickwts, "chickwts",
+ categories_type = "resource")
> dp_resources(dp) <- res
> (feed_name <- dp_resource(dp, "chickwts") |>
+ dp_field("feed") |> dp_property("categories"))
$resource
[1] "feed-categories"
Here the categories
property points to Data Resource.
dp_write_data()
will automatically create this resource by
default when writing the data:
> dp_write_data(dp_resource(dp, "chickwts"), data = chickwts, write_categories = TRUE)
> list.files(dir)
[1] "chickwts.csv" "datapackage.json" "feed-categories.csv"
[4] "iris.csv"
> dp_resource(dp, "feed-categories") |> dp_get_data()
value label
1 1 casein
2 2 horsebean
3 3 linseed
4 4 meatmeal
5 5 soybean
6 6 sunflower
By default the package will generate a list of categories for factor variables. The levels will be numbered using sequential integers starting from 1. The example below shows how different codes can be used.
In order to write the correct codes we will also first have to generate the and save the dataset with the correct codes. In the example below we do this using R, but it is of course also possible to generate the CSV using other methods (e.g. manual editing):
> codelist <- data.frame(
+ value = c(101, 102, 103, 202, 203, 204),
+ label = c("casein", "horsebean", "linseed", "meatmeal",
+ "soybean", "sunflower")
+ )
> res <- dp_generate_dataresource(codelist, "feed-categories")
> res
[feed-categories]
Selected properties:
path :"feed-categories.csv"
format :"csv"
mediatype:"text/csv"
encoding :"utf-8"
schema :Table Schema [2] "value" "label"
> dp_resources(dp) <- res
> codelistres <- dp |> dp_resource("feed-categories")
> dp_write_data(codelistres, data = codelist, write_categories = FALSE)
This creates the correct CSV-files:
> readLines(file.path(dir, "feed-categories.csv")) |> writeLines()
"value","label"
101,"casein"
102,"horsebean"
103,"linseed"
202,"meatmeal"
203,"soybean"
204,"sunflower"
When we now write the dataset to file it will use this dataset - as
long as we don’t overwrite it. Therefore, the
write_categories = FALSE
:
We can see that the correct codes are used in the CSV-file:
Editing of existing Data Packages is also possible. Use the
readonly = FALSE
argument when opening the Data
Package:
> edit <- open_datapackage(dir, readonly = FALSE)
> dp_id(edit) <- "iris_chkwts"
> dp_created(edit) <- Sys.time() |> as.Date()
Showing the complete datapackage.json
file after all of
the edits in this vignette:
> readLines(file.path(dir, "datapackage.json")) |> writeLines()
{
"name": "example",
"title": "An Example Data Package",
"resources": [
{
"name": "iris",
"path": "iris.csv",
"format": "csv",
"mediatype": "text/csv",
"encoding": "utf-8",
"schema": {
"fields": [
{
"name": "Sepal.Length",
"type": "number"
},
{
"name": "Sepal.Width",
"type": "number"
},
{
"name": "Petal.Length",
"type": "number"
},
{
"name": "Petal.Width",
"type": "number"
},
{
"name": "Species",
"type": "integer",
"categories": [
{
"value": 1,
"label": "setosa"
},
{
"value": 2,
"label": "versicolor"
},
{
"value": 3,
"label": "virginica"
}
]
}
]
},
"title": "The Iris dataset"
},
{
"name": "chickwts",
"path": "chickwts.csv",
"format": "csv",
"mediatype": "text/csv",
"encoding": "utf-8",
"schema": {
"fields": [
{
"name": "weight",
"type": "number"
},
{
"name": "feed",
"type": "integer",
"categories": {
"resource": "feed-categories"
}
}
]
}
},
{
"name": "feed-categories",
"path": "feed-categories.csv",
"format": "csv",
"mediatype": "text/csv",
"encoding": "utf-8",
"schema": {
"fields": [
{
"name": "value",
"type": "number"
},
{
"name": "label",
"type": "string"
}
]
}
}
],
"description": "This is a description of the Data Package",
"contributors": [
{
"title": "Jane Doe",
"role": "author",
"email": "j.doe@organisation.org"
}
],
"id": "iris_chkwts",
"created": "2025-03-13"
}