2017年5月24日星期三

Python邊學邊記錄-Crawler網路爬蟲-第三課-BeautifulSoup

BeautifulSoup

BeautifulSoup是一套在爬取網頁資訊上非常好用的lib，當然python上還有另一套selenium也可以做的到。

resp = requests.get('目標網址')
soup = BeautifulSoup(resp.text, 'html.parser')

取得第一個tag

soup.find('tagname')

或是

soup.tagname

這兩個方式都是可以取得第一個tag的方式。

取得tag底下的child tag

print(soup.tag.child tag.text)

print(soup.div.a.text)

意即取第一個div底下的a的文字

取得所有的tag

main_titles = soup.find_all('h4')
for title in main_titles:
    print(title.a.text)

透過find_all的方式可以取得所有需求的tag，再透過迴圈的方式將逐一處理
作法上還可以另外如下加上class的條件

soup.find_all('h4', 'card-title')

soup.find_all('h4', {'class': 'card-title'})

soup.find_all('h4', class_='card-title')

上面的意思就是取tag為h4且class為card-title的資料

透過id取得元件

soup.find(id='mac-p')

如果自訂了標籤『data-index』，並且想透用該tag來做條件的話，因為包含了『-』，會導致異常，所以只能透過最標準的作法來處理。

soup.find('', {'data-index': '123'})

tag的部份不給值，直接定義要搜尋的條件。

取得網頁上所有文字

divs = soup.find_all('div', 'content')for div in divs:
    1-使用 text (會包含許多換行符號跟空格)    print(div.text)    2-使用 tag 定位     print(div.h6.text.strip(), div.h4.a.text.strip(), div.p.text.strip())    3-使用 .stripped_strings    print([s for s in div.stripped_strings])

藤原栗子工作室

2017年5月24日星期三

Python邊學邊記錄-Crawler網路爬蟲-第三課-BeautifulSoup

BeautifulSoup

取得第一個tag

取得tag底下的child tag

取得所有的tag

透過id取得元件

取得網頁上所有文字

沒有留言:

張貼留言

2017年5月24日 星期三

Python邊學邊記錄-Crawler網路爬蟲-第三課-BeautifulSoup

BeautifulSoup

取得第一個tag

取得tag底下的child tag

取得所有的tag

透過id取得元件

取得網頁上所有文字

沒有留言:

張貼留言

2017年5月24日星期三