Вставка строки в середину URL-адреса в R

#r #rvest

Вопрос:

Я использую rvest для очистки списка IMDB и хочу получить доступ к списку полного состава и команды. К сожалению, IMDB создал сводную страницу, когда вы нажимаете на заголовок, и это приводит меня не на ту страницу.

Это веб-страница, которую я получаю: https://www.imdb.com/title/tt1375666/?ref_=ttls_li_tt

Это та веб-страница, которая мне нужна: https://www.imdb.com/title/tt1375666/fullcredits/?ref_=tt_ql_cl

Обратите внимание на добавление /fullcredits в URL-адрес.

Как я могу вставить /fullcredits в середину созданного мной URL-адреса?

 #install.packages("rvest") #install.packages("dplyr")  library(rvest) #webscraping package library(dplyr) #piping   link = "https://www.imdb.com/list/ls006266261/?st_dt=amp;mode=detailamp;page=1amp;sort=list_order,asc" credits = "fullcredits/" page = read_html(link)   name lt;- page %gt;% rvest::html_nodes(".lister-item-header a") %gt;% rvest::html_text() movie_link = page %gt;% rvest::html_nodes(".lister-item-header a") %gt;% html_attr("href") %gt;% paste("https://www.imdb.com", ., sep="")  

Ответ №1:

Вот вариант — получить dirname и basename по ссылке, заменить подстроку basename на новую подстроку («tt_ql_cl») и снова соединить их file.path после вставки «полных кредитов» между ними

 library(stringr) movie_link2 lt;- file.path(dirname(movie_link), "fullcredits",   str_replace(basename(movie_link), "ttls_li_tt", "tt_ql_cl"))  

-выход

 gt; head(movie_link2) [1] "https://www.imdb.com/title/tt0068646/fullcredits/?ref_=tt_ql_cl" [2] "https://www.imdb.com/title/tt0099685/fullcredits/?ref_=tt_ql_cl" [3] "https://www.imdb.com/title/tt0110912/fullcredits/?ref_=tt_ql_cl" [4] "https://www.imdb.com/title/tt0114814/fullcredits/?ref_=tt_ql_cl" [5] "https://www.imdb.com/title/tt0078788/fullcredits/?ref_=tt_ql_cl" [6] "https://www.imdb.com/title/tt0117951/fullcredits/?ref_=tt_ql_cl" gt; tail(movie_link2) [1] "https://www.imdb.com/title/tt0144084/fullcredits/?ref_=tt_ql_cl" [2] "https://www.imdb.com/title/tt0119654/fullcredits/?ref_=tt_ql_cl" [3] "https://www.imdb.com/title/tt0477348/fullcredits/?ref_=tt_ql_cl" [4] "https://www.imdb.com/title/tt0080339/fullcredits/?ref_=tt_ql_cl" [5] "https://www.imdb.com/title/tt0469494/fullcredits/?ref_=tt_ql_cl" [6] "https://www.imdb.com/title/tt1375666/fullcredits/?ref_=tt_ql_cl"  

Ответ №2:

Другой способ,

 df1 = gsub("\?.*", "", movie_link) df = paste0(df1, 'fullcredits/?ref_=tt_ql_cl') df  [1] "https://www.imdb.com/title/tt0068646/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0099685/fullcredits/?ref_=tt_ql_cl"  [3] "https://www.imdb.com/title/tt0110912/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0114814/fullcredits/?ref_=tt_ql_cl"  [5] "https://www.imdb.com/title/tt0078788/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0117951/fullcredits/?ref_=tt_ql_cl"  [7] "https://www.imdb.com/title/tt0137523/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0108052/fullcredits/?ref_=tt_ql_cl"  [9] "https://www.imdb.com/title/tt0118749/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0105236/fullcredits/?ref_=tt_ql_cl"  [11] "https://www.imdb.com/title/tt0111161/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0073195/fullcredits/?ref_=tt_ql_cl"  [13] "https://www.imdb.com/title/tt0075314/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0119488/fullcredits/?ref_=tt_ql_cl"