#r #rvest
Вопрос:
Я использую rvest для очистки списка IMDB и хочу получить доступ к списку полного состава и команды. К сожалению, IMDB создал сводную страницу, когда вы нажимаете на заголовок, и это приводит меня не на ту страницу.
Это веб-страница, которую я получаю: https://www.imdb.com/title/tt1375666/?ref_=ttls_li_tt
Это та веб-страница, которая мне нужна: https://www.imdb.com/title/tt1375666/fullcredits/?ref_=tt_ql_cl
Обратите внимание на добавление /fullcredits
в URL-адрес.
Как я могу вставить /fullcredits
в середину созданного мной URL-адреса?
#install.packages("rvest") #install.packages("dplyr") library(rvest) #webscraping package library(dplyr) #piping link = "https://www.imdb.com/list/ls006266261/?st_dt=amp;mode=detailamp;page=1amp;sort=list_order,asc" credits = "fullcredits/" page = read_html(link) name lt;- page %gt;% rvest::html_nodes(".lister-item-header a") %gt;% rvest::html_text() movie_link = page %gt;% rvest::html_nodes(".lister-item-header a") %gt;% html_attr("href") %gt;% paste("https://www.imdb.com", ., sep="")
Ответ №1:
Вот вариант — получить dirname
и basename
по ссылке, заменить подстроку basename
на новую подстроку («tt_ql_cl») и снова соединить их file.path
после вставки «полных кредитов» между ними
library(stringr) movie_link2 lt;- file.path(dirname(movie_link), "fullcredits", str_replace(basename(movie_link), "ttls_li_tt", "tt_ql_cl"))
-выход
gt; head(movie_link2) [1] "https://www.imdb.com/title/tt0068646/fullcredits/?ref_=tt_ql_cl" [2] "https://www.imdb.com/title/tt0099685/fullcredits/?ref_=tt_ql_cl" [3] "https://www.imdb.com/title/tt0110912/fullcredits/?ref_=tt_ql_cl" [4] "https://www.imdb.com/title/tt0114814/fullcredits/?ref_=tt_ql_cl" [5] "https://www.imdb.com/title/tt0078788/fullcredits/?ref_=tt_ql_cl" [6] "https://www.imdb.com/title/tt0117951/fullcredits/?ref_=tt_ql_cl" gt; tail(movie_link2) [1] "https://www.imdb.com/title/tt0144084/fullcredits/?ref_=tt_ql_cl" [2] "https://www.imdb.com/title/tt0119654/fullcredits/?ref_=tt_ql_cl" [3] "https://www.imdb.com/title/tt0477348/fullcredits/?ref_=tt_ql_cl" [4] "https://www.imdb.com/title/tt0080339/fullcredits/?ref_=tt_ql_cl" [5] "https://www.imdb.com/title/tt0469494/fullcredits/?ref_=tt_ql_cl" [6] "https://www.imdb.com/title/tt1375666/fullcredits/?ref_=tt_ql_cl"
Ответ №2:
Другой способ,
df1 = gsub("\?.*", "", movie_link) df = paste0(df1, 'fullcredits/?ref_=tt_ql_cl') df [1] "https://www.imdb.com/title/tt0068646/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0099685/fullcredits/?ref_=tt_ql_cl" [3] "https://www.imdb.com/title/tt0110912/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0114814/fullcredits/?ref_=tt_ql_cl" [5] "https://www.imdb.com/title/tt0078788/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0117951/fullcredits/?ref_=tt_ql_cl" [7] "https://www.imdb.com/title/tt0137523/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0108052/fullcredits/?ref_=tt_ql_cl" [9] "https://www.imdb.com/title/tt0118749/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0105236/fullcredits/?ref_=tt_ql_cl" [11] "https://www.imdb.com/title/tt0111161/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0073195/fullcredits/?ref_=tt_ql_cl" [13] "https://www.imdb.com/title/tt0075314/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0119488/fullcredits/?ref_=tt_ql_cl"