Skip to contents

Parses the free-text observations trait (id_trait = 13) of the given individuals, splits multi-observation entries into atomic phrases, and matches each phrase against a regex ontology to derive standardized rows for the mortality_risk_flag trait (multi-valued) and the dawkins_index trait (single-valued). The original observations trait is not modified.

Usage

standardize_observations(
  individual_ids,
  ontology = NULL,
  add_data = FALSE,
  dry_run = TRUE,
  mortality_trait_name = "mortality_risk_flag",
  dawkins_trait_id = 15L,
  obs_trait_id = 13L,
  flag1_trait_id = 19L,
  con = NULL
)

Arguments

individual_ids

Integer vector of individual IDs.

ontology

A data frame with columns trait, std_value, pattern, or a path to a CSV with those columns. Defaults to the package ontology.

add_data

Logical. If TRUE, upsert the derived rows into the database. Default FALSE.

dry_run

Logical. When add_data = TRUE, preview without committing changes. Default TRUE.

mortality_trait_name

Name of the categorical trait that receives mortality risk tokens. Default "mortality_risk_flag".

dawkins_trait_id

Trait ID of the dawkins trait. Default 15L.

obs_trait_id

Trait ID of the free-text observations source. Default 13L.

flag1_trait_id

Trait ID of flag1_rainfor (single-letter alive-stem condition code). Default 19L. Codes are decoded with the OpenForis mapping (.default_observation_flags) and appended to the mortality rows derived from free text. Rows are de-duplicated per (id_n, id_sub_plots, std_value); the source_phrases column records whether a row came from text, from a flag, or both.

con

Database connection. Defaults to call.mydb().

Value

A tibble with one row per (id_n, census_name, trait, std_value):

id_n

Individual ID.

id_table_liste_plots

Plot ID.

id_sub_plots

Census subplot ID (used for DB linking).

plot_name, tag

Plot name and stem tag.

census_name, census_date

Census label and date.

trait

Target trait — "mortality_risk_flag" or "dawkins_index".

std_value

Standardized token.

source_phrases

The raw phrase(s) that triggered the match.

full_observation

The full original observations string.

skip_existing

Logical — TRUE for dawkins rows whose individual x census already has a dawkins value in the DB (these rows are skipped on write).

Additionally, the attribute "unresolved" on the returned tibble holds a tibble of phrases (with counts) that matched no pattern — useful for growing the ontology.

Details

Existing dawkins_index measurements are never overwritten; derived dawkins values for individual x census combinations already present in the DB are dropped from the output of the DB write (they remain in the returned tibble flagged as skip_existing = TRUE).

Ontology

By default the function reads system.file("ontology", "observations_ontology.csv", package = "CafriplotsR"). Columns expected: trait, std_value, pattern. Patterns are case-insensitive Perl regexes. Provide ontology (a data frame or a path) to override.

Examples

if (FALSE) { # \dontrun{
con <- call.mydb()
plots <- query_plots(method = "1ha-IRD", extract_individuals = TRUE,
                     country = "CAMEROON")
res <- standardize_observations(individual_ids = plots$individuals$id_n,
                                con = con)
attr(res, "unresolved")

# Preview the upsert
standardize_observations(individual_ids = plots$individuals$id_n,
                         add_data = TRUE, dry_run = TRUE, con = con)

# Commit
standardize_observations(individual_ids = plots$individuals$id_n,
                         add_data = TRUE, dry_run = FALSE, con = con)
} # }