这边我是打API爬的,所以先写了序列化:
class IgCommentsSerializer(serializers.Serializer):
post = serializers.CharField(max_length=1000)
poster = serializers.CharField(max_length=200)
一开始先写selenium的webserver基本设定,要抓取全部留言的话需先登入完ig,再跳转到要的post页面:
class IgComments(APIView):
def __init__(self):
self.path = 'chromedriver的路徑'
self.sbaccount = '帳號'
self.sbpd = '密碼'
def post(self, request):
options = Options()
options.add_argument("--headless") # 執行時瀏覽器只在背景執行
driver = webdriver.Chrome(self.path, options=options)
driver.implicitly_wait(3)
driver.get('https://www.instagram.com/')
time.sleep(2)
account = driver.find_elements_by_name('username')[0]
pd = driver.find_elements_by_name('password')[0]
# 登入
account.send_keys(self.sbaccount)
pd.send_keys(self.sbpd)
driver.find_element_by_xpath('//*[@id="loginForm"]/div/div[3]/button').click()
time.sleep(3)
driver.get('https://www.instagram.com/p/CYXqAMuBX0e/') # 直接跳轉到post
more_xpath = '//*[@id="react-root"]/section/main/div/div[1]/article/div/div[2]/div/div[2]/div[1]/ul/li/div/button/div'
time.sleep(2)
ig一次只会载入12则留言(我没记错的话XD),这边需要借助selenium的力量,自动化点击更多留言按钮。 而为了要抓取所有留言,这边透过while循环直到没有更多留言按钮可点击,之后就可以一次性地抓取所有留言文字了:
...接下上部分程式碼...
while True:
try:
time.sleep(2)
driver.find_element_by_xpath(more_xpath).click()
print('下一頁')
except:
print('最後一頁')
break
crawl_comments = []
comments = driver.find_element_by_class_name("XQXOT").find_elements_by_class_name("Mr508")
n = 1
for c in comments:
poster = c.find_element_by_css_selector('h3._6lAjh span').text
post_xpath = f'//*[@id="react-root"]/section/main/div/div[1]/article/div/div[2]/div/div[2]/div[1]/ul/ul[{n}]/div/li/div/div/div[2]/span'.format(n=n)
time.sleep(2)
post = c.find_element_by_xpath(post_xpath).text
crawl_comments.append({'poster':poster, 'post':post})
n+=1
ser = IgCommentsSerializer(crawl_comments, many=True)
return Response(ser.data)
赞 (0)
打赏
微信扫一扫
