MCPcopy Index your code
hub / github.com/lining0806/PythonSpiderNotes / New_Page_Info

Function New_Page_Info

NewsSpider/NewsSpider.py:23–35  ·  view source on GitHub ↗

Regex(slowly) or Xpath(fast)

(new_page)

Source from the content-addressed store, hash-verified

21 return mypage_Info
22
23def New_Page_Info(new_page):
24 '''Regex(slowly) or Xpath(fast)'''
25 # new_page_Info = re.findall(r'<td class=".*?">.*?<a href="(.*?)\.html".*?>(.*?)</a></td>', new_page, re.S)
26 # # new_page_Info = re.findall(r'<td class=".*?">.*?<a href="(.*?)">(.*?)</a></td>', new_page, re.S) # bugs
27 # results = []
28 # for url, item in new_page_Info:
29 # results.append((item, url+".html"))
30 # return results
31 dom = etree.HTML(new_page)
32 new_items = dom.xpath('//tr/td/a/text()')
33 new_urls = dom.xpath('//tr/td/a/@href')
34 assert(len(new_items) == len(new_urls))
35 return zip(new_items, new_urls)
36
37def Spider(url):
38 i = 0

Callers 1

SpiderFunction · 0.85

Calls

no outgoing calls

Tested by

no test coverage detected