<!--
%\VignetteEngine{simplermarkdown::mdweave_to_html}
%\VignetteIndexEntry{Creating a Data Package}
-->

---
title: Creating a Data Package
author: Jan van der Laan
css: "style.css"
---

This vignette will show how to create and edit Data Packages.

## Creating a Data Package

The `new_datapackage()` function creates a new Data Package

```{.R #n1}
library(datapackage)

dir <- tempfile()
dp <- new_datapackage(dir, name = "example", 
  title = "An Example Data Package")
dp
```
This will return an `editabledatapackage`. This means that any changes to the
Data Package are immediately saved to the `datapackage.json` file and when
reading any properties these are read from the file. It is, therefore, possible
to manually edit the `datapackage.json` file while working in R with the Data
Package. 

```{.R #n2}
list.files(dir)
```

Using methods such as `dp_title()` and `dp_description()` the properties of the Data
Package can be modified.

```{.R #n3}
dp_description(dp) <- "This is a description of the Data Package"
```

The `description<-()` method also accepts a character vector of length > 1. This
makes it easy to read the contents of the description from file as it can be
difficult to write long descriptions directly from R-code. It is possible to use
markdown in the description.

```{.R #n4 eval=FALSE}
dp_description(dp) <- readLines("description.md")
```

The following methods a currently (when writing the vignette) supported:

- `dp_title<-()`
- `dp_contributors<-()` and `dp_add_contributor<-()`
- `dp_description<-()`
- `dp_id<-()`
- `dp_name<-()`
- `dp_created<-()`
- `dp_keywords<-()`
- `dp_property<-()`: this function also allow custom properties.

For an up to data list run the following:

```[.R #n5}
methods(class = "datapackage") |> (\(x) x[grep("<-", x)])()
```

Below an example of adding a contributor to the package

```{.R #n6}
dp_add_contributor(dp) <- new_contributor("Jane Doe", role = "author",
  email = "j.doe@organisation.org")
```


## Adding a dataset to the datapackage

In this example we will save the `iris` dataset to a new datapackage.

```{.R #a1}
data(iris)
head(iris)
```

In order to store a new dataset in a Data Package we need to do two things.
First, we need to create a new Data Resource in the package. Second, using the
specification of the Data Resource we need to save the actual dataset at the
location specified in the Data Resource.

It is possible to edit the `datapackage.json` file to create the new
Data Resource. The package also has a function `dp_generate_dataresource()` to
generate a skeleton Data Resource for a given dataset:

```{.R #a10}
res <- dp_generate_dataresource(iris, "iris") 
```

Again these can be further modified using methods such as `dp_title()` and
`dp_property()`:

```{.R #a30}
dp_title(res) <- "The Iris dataset"
```

Let's add the resources to the Data Package.

```{.R #a40}
dp_resources(dp) <- res
```

In this case the Data Package does not yet contain Data Resources. Should the
Data Package contain Data Resources with the same name, these will be overwritten
by the new Data Resource.

We are now ready to write the dataset. For this we can use the `dp_write_data()`
method:

```{.R #a50}
dp_write_data(dp, resource_name = "iris", data = iris)
```

When some of the field in the Data Resource have categories that are stored in
a separate Data Resource, this function will by default also write any
categories lists associated with the Data Resource.

```{.R #a60}
readLines(file.path(dir, "iris.csv"), n = 10) |> writeLines()
```

And of course we can open the Data Package and read the data back in:

```{.R #a70}
dp2 <- open_datapackage(dir)
iris2 <- dp2 |> dp_resource("iris") |> dp_get_data(convert_categories = "to_factor")
all.equal(iris, iris2, check.attributes = FALSE)
```


## More on categories

By default `dp_generate_dataresource()` will generate `categories` properties for
factor fields:

```{.R #c00}
data(chickwts)

res <- dp_generate_dataresource(chickwts, "chickwts") 
dp_resources(dp) <- res

(feed_name <- dp_resource(dp, "chickwts") |> 
  dp_field("feed") |> dp_property("categories"))
```

Here, the list of categories is stored directly in the `categories` property. It
is also possible to store the list of categories in a Data Resource

```{.R #c01}
res <- dp_generate_dataresource(chickwts, "chickwts", 
  categories_type = "resource") 
dp_resources(dp) <- res

(feed_name <- dp_resource(dp, "chickwts") |> 
  dp_field("feed") |> dp_property("categories"))
```
Here the `categories` property points to Data Resource. `dp_write_data()` will
automatically create this resource by default when writing the data:

```{.R #c02}
dp_write_data(dp_resource(dp, "chickwts"), data = chickwts, write_categories = TRUE)
list.files(dir)

dp_resource(dp, "feed-categories") |> dp_get_data()
```

By default the package will generate a list of categories for factor variables.
The levels will be numbered using sequential integers starting from 1. The
example below shows how different codes can be used. 

In order to write the correct codes we will also first have to generate the and
save the dataset with the correct codes. In the example below we do this using
R, but it is of course also possible to generate the CSV using other methods
(e.g. manual editing):
```{.R #c10}
codelist <- data.frame(
  value = c(101, 102, 103, 202, 203, 204),
  label = c("casein", "horsebean", "linseed", "meatmeal", 
    "soybean", "sunflower")
)
res <- dp_generate_dataresource(codelist, "feed-categories")
res
dp_resources(dp) <- res

codelistres <- dp |> dp_resource("feed-categories")
dp_write_data(codelistres, data = codelist, write_categories = FALSE)
```
This creates the correct CSV-files:

```{.R #c20}
readLines(file.path(dir, "feed-categories.csv")) |> writeLines()
```

When we now write the dataset to file it will use this dataset - as long as we
don't overwrite it. Therefore, the `write_categories = FALSE`: 

```{.R #c30}
dp_write_data(dp, resource_name = "chickwts", data = chickwts, write_categories = FALSE)
```

We can see that the correct codes are used in the CSV-file:
```{.R #c40}
readLines(file.path(dir, "chickwts.csv"), n = 10) |> writeLines()
```


## Editing an existing Data Package

Editing of existing Data Packages is also possible. Use the `readonly = FALSE`
argument when opening the Data Package:

```{.R #e00}
edit <- open_datapackage(dir, readonly = FALSE)
 
dp_id(edit) <- "iris_chkwts"
dp_created(edit) <- Sys.time() |> as.Date()
```

Showing the complete `datapackage.json` file after all of the edits in this
vignette:
```{.R #e10}
readLines(file.path(dir, "datapackage.json")) |> writeLines()
```


```{.R #cleanup echo=FALSE results=FALSE}
file.remove(list.files(dir, full.names = TRUE)) 
file.remove(dir)
```