2017年5月25日星期四

Python邊學邊記錄-Crawler網路爬蟲-第四課-爬表格

Python Crawler

今天的課是學怎麼去爬表格的資料，作法上跟之前在寫ASP.NET的時候處理GridView差不多，果然是萬變不離其宗!

假如網頁畫面如下：

項次	項目	價格	連結
1	國文	1200	http://123.com
2	英文	1800	http://123.com
3	數學	1500	http://123.com
4	理化	2000	http://123.com

首先，一樣要先透過requests.get連到該目標網址，然後一樣丟給了BeauitfulSoup去處理!

resp = requests.get('目標網址')
soup = BeautifulSoup(resp.text, 'html.parser')

tr就跟row一樣，所以先取tr資料

rows = soup.find('table', 'table').tbody.find_all('tr')

然後就透過迴圈去把所有tr的價格資料取出，價格td在第三欄，以index來計算的話是2。
(註：目前只有遇到generol的index是從1開始@@)

for row in rows:

  price = row.find_all('td')[2].text

基本上，這樣子就可以取得price了。

如果有想要換平均課程價格的話，那就可以先宣告一個list
prices = []

然後在迴圈中append進去

for row in rows:

  price = row.find_all('td')[2].text

  prices.append(int(price))

總金額

sum(prices)

len(prices)

課程數

python的list加總真的很方便!

另一種作法的話，就是透過tag的父子兄關係去做定位。
table
tr
td
td
td價格
td連結
a

我們可以從『a』這個tag去找他爸『td連結』再找他兄弟『td價格』
這時候的作法就變成先取得『a』的定位

links = soup.find_all('a')

接著透過『a』來找他的父兄

for link in links:

  price=link.parent.previous_sibling.text

.parent(父).previous_sibling(兄) 作法上跟處理一些網頁是一樣的。

如果要把所有的表格資料列印出來的話，作法是一樣的。

rows = soup.find('table','table').find_all('tr') # 先取得所有的tr資料

for row in rows:

  #另一種取得所有td的方式

  #all_tds = [td for td in row.children]

  all_tds = row.find_all('td') # 取得所有的td

  print(all_tds[0].text..XXXX) # 透過index去取值即可

當然了，如果有時候連結沒有放上去的話，那就會造成異常，所以需要防呆!

rows = soup.find('table','table').find_all('tr') # 先取得所有的tr資料
for row in rows:
  all_tds = row.find_all('td') # 取得所有的td
  if 'href' in all_tds[3].a.attrs: # 確認href是否存在
    href = all_tds[3].a['href']
  else:
    href = None
  print(all_tds[0].text..XXXX) # 透過index去取值即可

另一種作法的話就是可以透過stripped_strings來處理!

rows = soup.find('table','table').find_all('tr') # 先取得所有的tr資料

for row in rows:

  print([s for s in row.stripped_strings])

s for s in subsets 就等於
ss = []
for s in subsets(s):
ss.apped(s)

藤原栗子工作室

2017年5月25日星期四

Python邊學邊記錄-Crawler網路爬蟲-第四課-爬表格

Python Crawler

沒有留言:

張貼留言

2017年5月25日 星期四

Python邊學邊記錄-Crawler網路爬蟲-第四課-爬表格

Python Crawler

沒有留言:

張貼留言

2017年5月25日星期四