Новый столбец набора данных, основанный на определенных символах из других столбцов (в R)

#r #database #if-statement

Вопрос:

в моем наборе данных я хочу создать новый столбец, который будет зависеть от символов из двух других столбцов. Если в файле longDesciptions.desc.en_US есть текстовое поле, И в то же время внешний код начинается с цифры 1, добавьте значение A в новый столбец. Если в файле longDesciptions.desc.en_US нет словосочетания, И в то же время внешний код начинается с цифры 1, добавьте значение B в новый столбец. В противном случае оставьте его пустым или НЕТ.

 df lt;- structure(list(X.OPERATOR. = c(" Clear and Delete", NA, NA, NA,  NA, "lt;pgt;Je voornaamste taken:lt;/pgt;"), externalCode = c("Job Profile.GUID",  "1008141", "1008168", "1008170", "1008170", NA), longDesciptions.sectionId = c("sectionId",  "199624017", "200226564", "200226592", "200226594", NA), longDesciptions.sectionType = c("sectionType",  "LONGDESCRIPTION", "LONGDESCRIPTION", "LONGDESCRIPTION", "LONGDESCRIPTION",  NA), longDesciptions.desc.en_US = c("US English", "Class: 06, Plage: C, Function code:",  "Class: 03", "Class: 03", "lt;pgt;Als Legal Counsel maak je deel uit van het departement Secretariaat-Generaal. Je ondersteunt zowel de secretaris-generaal en de directie alsook de verschillende entiteiten van Elia groep, zowel op nationaal als internationaal niveau.lt;/pgt;",  NA), longDesciptions.desc.defaultValue = c("Default Value", "Class: 06, Plage: C, Function code:",  "Class: 03", "Class: 03", NA, NA), longDesciptions.desc.en_GB = c("English (United Kingdom)",  "Class: 06, Plage: C, Function code:", "Class: 03", "Class: 03",  NA, NA), longDesciptions.desc.de_DE = c("German (Germany)", NA,  NA, NA, NA, NA), longDesciptions.desc.fr_FR = c("French (France)",  "Classe: 06, Plage: C, Code de la fonction:", "Classe: 03", "Classe: 03",  NA, NA), longDesciptions.desc.nl_NL = c("Dutch (Netherlands)",  "Klasse: 06, Plage: C, Functiecode:", "Klasse: 03", "Klasse: 03",  NA, NA), longDesciptions.status = c("status(Valid Values : A/I A for Active I for Inactive )",  "A", "A", "A", NA, NA), longDesciptions.externalCode = c("externalCode",  "1035137", "1035330", "1035330", NA, NA), longDesciptions.subModule = c("subModule",  NA, NA, NA, NA, NA), NA. = c(NA_character_, NA_character_, NA_character_,  NA_character_, NA_character_, NA_character_), NA..1 = c(NA_character_,  NA_character_, NA_character_, NA_character_, NA_character_, NA_character_ ), NA..2 = c(NA_character_, NA_character_, NA_character_, NA_character_,  NA_character_, NA_character_), NA..3 = c(NA_character_, NA_character_,  NA_character_, NA_character_, NA_character_, NA_character_),   NA..4 = c(NA_character_, NA_character_, NA_character_, NA_character_,   NA_character_, NA_character_), NA..5 = c(NA_character_, NA_character_,   NA_character_, NA_character_, NA_character_, NA_character_  ), NA..6 = c(NA_character_, NA_character_, NA_character_,   NA_character_, NA_character_, NA_character_), NA..7 = c(NA_character_,   NA_character_, NA_character_, NA_character_, NA_character_,   NA_character_), NA..8 = c(NA_character_, NA_character_, NA_character_,   NA_character_, NA_character_, NA_character_), NA..9 = c(NA_character_,   NA_character_, NA_character_, NA_character_, NA_character_,   NA_character_), NA..10 = c(NA_character_, NA_character_,   NA_character_, NA_character_, NA_character_, NA_character_  ), NA..11 = c(NA_character_, NA_character_, NA_character_,   NA_character_, NA_character_, NA_character_), NA..12 = c(NA_character_,   NA_character_, NA_character_, NA_character_, NA_character_,   NA_character_), NA..13 = c(NA_character_, NA_character_,   NA_character_, NA_character_, NA_character_, NA_character_  ), NA..14 = c(NA_character_, NA_character_, NA_character_,   NA_character_, NA_character_, NA_character_), NA..15 = c(NA_character_,   NA_character_, NA_character_, NA_character_, NA_character_,   NA_character_), NA..16 = c(NA_character_, NA_character_,   NA_character_, NA_character_, NA_character_, NA_character_  ), NA..17 = c(NA_character_, NA_character_, NA_character_,   NA_character_, NA_character_, NA_character_), NA..18 = c(NA_character_,   NA_character_, NA_character_, NA_character_, NA_character_,   NA_character_), NA..19 = c(NA_character_, NA_character_,   NA_character_, NA_character_, NA_character_, NA_character_  ), NA..20 = c(NA_character_, NA_character_, NA_character_,   NA_character_, NA_character_, NA_character_), NA..21 = c(NA_character_,   NA_character_, NA_character_, NA_character_, NA_character_,   NA_character_), NA..22 = c(NA_character_, NA_character_,   NA_character_, NA_character_, NA_character_, NA_character_  ), NA..23 = c(NA_character_, NA_character_, NA_character_,   NA_character_, NA_character_, NA_character_), NA..24 = c(NA_character_,   NA_character_, NA_character_, NA_character_, NA_character_,   NA_character_), NA..25 = c(NA_character_, NA_character_,   NA_character_, NA_character_, NA_character_, NA_character_  ), NA..26 = c(NA_character_, NA_character_, NA_character_,   NA_character_, NA_character_, NA_character_), NA..27 = c(NA_character_,   NA_character_, NA_character_, NA_character_, NA_character_,   NA_character_), NA..28 = c(NA_character_, NA_character_,   NA_character_, NA_character_, NA_character_, NA_character_  ), NA..29 = c(NA_character_, NA_character_, NA_character_,   NA_character_, NA_character_, NA_character_), NA..30 = c(NA_character_,   NA_character_, NA_character_, NA_character_, NA_character_,   NA_character_), NA..31 = c(NA_character_, NA_character_,   NA_character_, NA_character_, NA_character_, NA_character_  ), NA..32 = c(NA_character_, NA_character_, NA_character_,   NA_character_, NA_character_, NA_character_), NA..33 = c(NA_character_,   NA_character_, NA_character_, NA_character_, NA_character_,   NA_character_), NA..34 = c(NA_character_, NA_character_,   NA_character_, NA_character_, NA_character_, NA_character_  )), class = "data.frame", row.names = c(NA, -6L))   

Я пробовал этот код, но он не работает:

 df2[,49] lt;- NA # names(df2)[49] lt;- "JobDescrip"   for (i in 1 : nrow(df2)) {  if (df2$externalCode[i] == '^1' amp;amp; df2$longDesciptions.sectionId[i]==  '^P') {  df2[i,49] lt;- "A"  }   if (df2$externalCode[i] == '^1') {  df2[i,49] lt;- "B"  }   else {  df2[i,49] lt;- ""  } }  Error in if (df2$externalCode[i] == "^1" amp;amp; df2$longDesciptions.sectionId[i] == :   missing value where TRUE/FALSE needed   

Я знаю, что этот тип вопроса задавался много раз, но я не мог найти решение, приемлемое для моих данных. Любая помощь будет признательна!

Ответ №1:

Вот tidyverse подход, который вы могли бы рассмотреть. Я бы подумал о других векторизованных подходах вместо цикла.

В этом случае вы можете использовать mutate from dplyr для добавления нового столбца и case_when вместо нескольких if операторов для добавления логики. Если первая оценка ложна, то проверяется вторая оценка и так далее.

Если вы используете grepl , вы можете проверить, содержит ли строка «Plage» (вы можете рассмотреть альтернативы для других шаблонов регулярных выражений). Использование substr может просматривать определенные символы в строке.

 library(dplyr)  df %gt;%  mutate(job_descrip = case_when(  grepl("Plage", longDesciptions.desc.en_US) amp; substr(externalCode, 1, 1) == "1" ~ "A",  substr(externalCode, 1, 1) == "1" ~ "B",  TRUE ~ NA_character_  ))