Как разделить строку на регулярные интервалы в R?

#r #split

#r #разделить

Вопрос:

У меня есть длинная строка, которую я хочу разделить на регулярные интервалы, скажем, по 10 слов каждый:

 x <- "Hrothgar, king of the Danes, or Scyldings, builds a great mead-hall, or palace, in which he hopes to feast his liegemen and to give them presents. The joy of king and retainers is, however, of short duration. Grendel, the monster, is seized with hateful jealousy. He cannot brook the sounds of joyance that reach him down in his fen-dwelling near the hall. Oft and anon he goes to the joyous building, bent on direful mischief. Thane after thane is ruthlessly carried off and devoured, while no one is found strong enough and bold enough to cope with the monster. For twelve years he persecutes Hrothgar and his vassals."
  

С помощью strsplit I можно разделить предложение на отдельные слова:

 x1 <- unlist(strsplit(x, " "))
  

Используя paste я могу склеить по 10 слов в каждом:

 paste(x1[1:10], collapse = " ")
paste(x1[11:20], collapse = " ")
...
paste(x1[101:110], collapse = " ")
  

Но это утомительно, поэтому я попытался sapply и seq :

 lapply(x1, function(x) paste(x[seq(1,100,10)], collapse = " "))
  

но результат не то, что я хочу. Я хочу что-то вроде этого:

 [1] "Hrothgar, king of the Danes, or Scyldings, builds a great"
[2] "mead-hall, or palace, in which he hopes to feast his"
[3] "liegemen and to give them presents. The joy of king"
[4] "and retainers is, however, of short duration. Grendel, the monster,"
[5] "is seized with hateful jealousy. He cannot brook the sounds"
...
[10] "twelve years he persecutes Hrothgar and his vassals. NA NA"
  

Я открыт для любого решения, но был бы особенно благодарен за base R одно.

Ответ №1:

Другой вариант с only base R , использующий regex для захвата ( \1 ) групп из 10 слов (буквенно-цифровых символов, которые могут содержать дефис, с привязкой к слову b ) и знаков препинания, и помещает «замечательную» строку ( "XXX" здесь) в конец, чтобы впоследствии ее можно было разделить на эту строку (поставив пробел передэта строка в strsplit шаблоне позволяет избежать конечного пробела в конце каждого бита):

 unlist(strsplit(gsub("(((\w|-) \b[ ,.]*){10})", "\1XXX", x), " XXX"))

# [1] "Hrothgar, king of the Danes, or Scyldings, builds a great"          
# [2] "mead-hall, or palace, in which he hopes to feast his"               
# [3] "liegemen and to give them presents. The joy of king"                
# [4] "and retainers is, however, of short duration. Grendel, the monster,"
# [5] "is seized with hateful jealousy. He cannot brook the sounds"        
# [6] "of joyance that reach him down in his fen-dwelling near"            
# [7] "the hall. Oft and anon he goes to the joyous"                       
# [8] "building, bent on direful mischief. Thane after thane is ruthlessly"
# [9] "carried off and devoured, while no one is found strong"             
#[10] "enough and bold enough to cope with the monster. For"               
#[11] "twelve years he persecutes Hrothgar and his vassals."     
  

Комментарии:

1. Очень приятно. Играя с вашим шаблоном, я также нашел это stringr решение: str_extract_all(x, "((\w|-) \b[ ,.]*){1,10}")

2. stringr функции действительно могут помочь, но тогда это уже не base R решение 😉

Ответ №2:

Вы могли бы создать последовательность и вставить слова из x1 :

 sapply(seq(1, length(x1), 10), function(i) 
       paste0(x1[i:min(i   9, length(x1))], collapse = " "))

# [1] "Hrothgar, king of the Danes, or Scyldings, builds a great"          
# [2] "mead-hall, or palace, in which he hopes to feast his"               
# [3] "liegemen and to give them presents. The joy of king"                
# [4] "and retainers is, however, of short duration. Grendel, the monster,"
# [5] "is seized with hateful jealousy. He cannot brook the sounds"        
# [6] "of joyance that reach him down in his fen-dwelling near"            
# [7] "the hall. Oft and anon he goes to the joyous"                       
# [8] "building, bent on direful mischief. Thane after thane is ruthlessly"
# [9] "carried off and devoured, while no one is found strong"             
#[10] "enough and bold enough to cope with the monster. For"               
#[11] "twelve years he persecutes Hrothgar and his vassals."        
  

Комментарии:

1. Прекрасное решение. Не могли бы вы подробнее остановиться на этой части: x1[i:min(i 9, length(x1))] ?

2. x1[i:(i 9)] выбирает по 10 слов в каждом. Для последней итерации i 9 это даст вам 110, но у вас есть только 108 слов в x1. Поэтому я использую min(i 9, length(x1)) указание взять минимум i 9 или length(x1) .

Ответ №3:

Вы можете использовать gregexpr with regmatches и количественно определять слова с {1,10} помощью .

 trimws(regmatches(x, gregexpr("([^[:space:]] \s*){1,10}", x))[[1]])
# [1] "Hrothgar, king of the Danes, or Scyldings, builds a great"          
# [2] "mead-hall, or palace, in which he hopes to feast his"               
# [3] "liegemen and to give them presents. The joy of king"                
# [4] "and retainers is, however, of short duration. Grendel, the monster,"
# [5] "is seized with hateful jealousy. He cannot brook the sounds"        
# [6] "of joyance that reach him down in his fen-dwelling near"            
# [7] "the hall. Oft and anon he goes to the joyous"                       
# [8] "building, bent on direful mischief. Thane after thane is ruthlessly"
# [9] "carried off and devoured, while no one is found strong"             
#[10] "enough and bold enough to cope with the monster. For"               
#[11] "twelve years he persecutes Hrothgar and his vassals."               
  

Ответ №4:

Надеюсь, это может помочь

 sapply(
  unname(split(
    y <- unlist(strsplit(x, " ")),
    ceiling(seq_along(y) / 10)
  )),
  paste,
  collapse = " "
)
  

что дает

  [1] "Hrothgar, king of the Danes, or Scyldings, builds a great"
 [2] "mead-hall, or palace, in which he hopes to feast his"
 [3] "liegemen and to give them presents. The joy of king"
 [4] "and retainers is, however, of short duration. Grendel, the monster,"
 [5] "is seized with hateful jealousy. He cannot brook the sounds"
 [6] "of joyance that reach him down in his fen-dwelling near"
 [7] "the hall. Oft and anon he goes to the joyous"
 [8] "building, bent on direful mischief. Thane after thane is ruthlessly"
 [9] "carried off and devoured, while no one is found strong"
[10] "enough and bold enough to cope with the monster. For"
[11] "twelve years he persecutes Hrothgar and his vassals."
  

Ответ №5:

использование stringr:

 library(stringr)
N = length(strsplit(x, ' ')[[1]]) 
start = seq.int(1, N, 10)
end = start 9
end[length(end)] = N
word(x, start, end)

# [1] "Hrothgar, king of the Danes, or Scyldings, builds a great"          
# [2] "mead-hall, or palace, in which he hopes to feast his"               
# [3] "liegemen and to give them presents. The joy of king"                
# [4] "and retainers is, however, of short duration. Grendel, the monster,"
# [5] "is seized with hateful jealousy. He cannot brook the sounds"        
# [6] "of joyance that reach him down in his fen-dwelling near"            
# [7] "the hall. Oft and anon he goes to the joyous"                       
# [8] "building, bent on direful mischief. Thane after thane is ruthlessly"
# [9] "carried off and devoured, while no one is found strong"             
# [10] "enough and bold enough to cope with the monster. For"               
# [11] "twelve years he persecutes Hrothgar and his vassals." 
  

Комментарии:

1. Спасибо. lengths(strsplit(x, ' ')) тоже работает. Это действительно stringr (вы используете strsplit not str_split )?

2. word из stringr. Остальное — база R