Tag: beautifulsoup

我怎样才能从使用Python的HTML获得href链接？: import urllib2 website = "WEBSITE" openwebsite = urllib2.urlopen(website) html = getwebsite.read() print html 到现在为止还挺好。但我只想从纯文本HTML链接href。我该如何解决这个问题？

UnicodeEncodeError：'ascii'编解码器不能以特殊名称编码字符: 我的Python（版本2.7）脚本运行良好从本地html文件获得一些公司名称，但是当涉及到一些特定的国家名称，它给出了这个错误“UnicodeEncodeError：”ascii“编解码器不能编码字符” 当这个公司名字来临的时候特别会出错公司名称： KühlfixKälteanlagenIng.Gerhard Doczekal＆Co. KG 该链接无法处理 Traceback (most recent call last): File "C:\Python27\Process2.py", line 261, in <module> flog.write("\nCompany Name: "+str(pCompanyName)) UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 9: ordinal not in range(128) 错误在这行代码中给出： if companyAlreadyKnown == 0: for hit in soup2.findAll("h1"): print "Company Name: "+hit.text pCompanyName = hit.text flog.write("\nCompany Name: "+str(pCompanyName)) companyObj.setCompanyName(pCompanyName)

BeautifulSoup在复合类名称search时返回空列表: 当使用正则expression式search复合类名时，BeautifulSoup返回空列表。例： import re from bs4 import BeautifulSoup bs = """ <a class="name-single name692" href="www.example.com"">Example Text</a> """ bsObj = BeautifulSoup(bs) # this returns the class found_elements = bsObj.find_all("a", class_= re.compile("^(name-single.*)$")) # this returns an empty list found_elements = bsObj.find_all("a", class_= re.compile("^(name-single name\d*)$")) 我需要选课非常精确。有任何想法吗？

美丽的汤findAll没有find他们全部: 我试图parsing一个网站，并获取与BeautifulSoup.findAll一些信息，但它没有find他们..我使用python3 代码是这样的 #!/usr/bin/python3 from bs4 import BeautifulSoup from urllib.request import urlopen page = urlopen ("http://mangafox.me/directory/") # print (page.read ()) soup = BeautifulSoup (page.read ()) manga_img = soup.findAll ('a', {'class' : 'manga_img'}, limit=None) for manga in manga_img: print (manga['href']) 它只打印他们的一半…

使用BeautifulSoup删除标签，但保留其内容: 目前我有这样的代码： soup = BeautifulSoup(value) for tag in soup.findAll(True): if tag.name not in VALID_TAGS: tag.extract() soup.renderContents() 除了我不想丢弃无效标签内的内容。如何在调用soup.renderContents（）时去掉标签，但保留内容？

Python和BeautifulSoup编码问题: 我正在使用BeautifulSoup用Python编写一个爬虫程序，并且一切都很顺利，直到我跑进这个网站： http://www.elnorte.ec/ 我正在获取请求库的内容： r = requests.get('http://www.elnorte.ec/') content = r.content 如果我在这一点做了一个内容variables的打印，所有的西class牙特殊字符似乎工作正常。但是，一旦我尝试将内容variables提供给BeautifulSoup，它就会变得混乱： soup = BeautifulSoup(content) print(soup) … <a class="blogCalendarToday" href="/component/blog_calendar/?year=2011&month=08&day=27&modid=203" title="1009 artÃculos en este dÃa"> … 这显然是在捣毁所有西class牙特色字符（口音和什么）。我试过做content.decode（'utf-8'），content.decode（'latin-1'），也试着把fromEncoding参数搞乱到BeautifulSoup，把它设置成fromEncoding ='utf-8'和fromEncoding ='拉丁-1'，但仍然没有骰子。任何指针将不胜感激。

BeautifulSoup获取href: 我有以下汤： <a href="some_url">next</a> <span class="class">…</span> 从这我想提取href， "some_url" 我可以做，如果我只有一个标签，但这里有两个标签。我也可以获得'next'文本，但这不是我想要的。另外，是否有一个很好的描述API的例子。我正在使用标准文档，但是我正在寻找更有组织的东西。

使用美丽的方式提取属性值: 我试图提取网页上特定“input”标记中的单个“值”属性的内容。我使用下面的代码： import urllib f = urllib.urlopen("http://58.68.130.147") s = f.read() f.close() from BeautifulSoup import BeautifulStoneSoup soup = BeautifulStoneSoup(s) inputTag = soup.findAll(attrs={"name" : "stainfo"}) output = inputTag['value'] print str(output) 我得到一个TypeError：列表索引必须是整数，而不是str 即使从Beautifulsoup文档我明白，string应该不是一个问题在这里…但ia没有专家，我可能误解了。任何build议，非常感谢！提前致谢。

BeautifulSoup抓住可见的网页文本: 基本上，我想使用BeautifulSoup来严格地抓住网页上的可见文本。比如说，这个网页就是我的testing用例。我主要是想获得正文（文章），甚至可以在这里和那里的几个标签名称。我已经在这个问题中尝试了这个build议，它返回了很多我不想要的<script>标记和html注释。我无法弄清楚函数findAll()所需的参数，以便在网页上获取可见的文本。那么，我应该如何find所有可见的文本，不包括脚本，评论，CSS等？

通过urllib和python下载图片: 所以我试图制作一个Python脚本来下载webcomics并把它们放在我的桌面上的一个文件夹中。我在这里发现了一些类似的程序，做了类似的事情，但没有什么比我所需要的更多。我发现最相似的就在这里（ http://bytes.com/topic/python/answers/850927-problem-using-urllib-download-images ）。我试着用这个代码： >>> import urllib >>> image = urllib.URLopener() >>> image.retrieve("../../../comics/00000001.jpg","00000001.jpg") ('00000001.jpg', <httplib.HTTPMessage instance at 0x1457a80>) 然后我search了我的计算机上的“00000001.jpg”文件，但是我发现的只是caching的图片。我什至不知道它保存到我的电脑的文件。一旦我明白如何获得文件下载，我想我知道如何处理其余的。基本上只是使用for循环，并将string拆分为“00000000”，“jpg”和将“00000000”增加到最大的数字，我必须以某种方式确定。任何reccomendations最好的方式来做到这一点或如何正确下载文件？谢谢！编辑6/15/10 这是完整的脚本，它将文件保存到您select的任何目录。由于一些奇怪的原因，这些文件没有下载，他们只是做了。任何build议如何清理它将不胜感激。我目前正在研究如何找出网站上存在的许多漫画，所以我可以得到最新的漫画，而不是在发生一定数量的exception之后退出程序。 import urllib import os comicCounter=len(os.listdir('/file'))+1 # reads the number of files in the folder to start downloading at the next comic errorCount=0 def download_comic(url,comicName): […]

Tag: beautifulsoup

我怎样才能从使用Python的HTML获得href链接？

UnicodeEncodeError：'ascii'编解码器不能以特殊名称编码字符

BeautifulSoup在复合类名称search时返回空列表

美丽的汤findAll没有find他们全部

使用BeautifulSoup删除标签，但保留其内容

Python和BeautifulSoup编码问题

BeautifulSoup获取href

使用美丽的方式提取属性值

BeautifulSoup抓住可见的网页文本

通过urllib和python下载图片

Firefox 6无限页面刷新带有哈希标签的页面

在引导模态打开时调用一个函数

asp.net mvc3razor文件？

long和int数据types的区别

-pthread标志在编译时的意义

控制表格单元格之间的间距

如何以编程方式确定蓝牙设备是否已连接？（Android 2.2）

覆盖单个文件的编译标志

如何计算逻辑向量中的TRUE值

计算两个Javadate实例之间的差异

如何垂直居中UITextField文本？

为什么使用Optional.of over Optional.ofNullable？

Python setup.py开发与安装

真正的鼠标在canvas上的位置

CGContextSaveGState：无效上下文0x0（Xcode 7 GM）

Tag: beautifulsoup

我怎样才能从使用Python的HTML获得href链接？

UnicodeEncodeError：'ascii'编解码器不能以特殊名称编码字符

BeautifulSoup在复合类名称search时返回空列表

美丽的汤findAll没有find他们全部

使用BeautifulSoup删除标签，但保留其内容

Python和BeautifulSoup编码问题

BeautifulSoup获取href

使用美丽的方式提取属性值

BeautifulSoup抓住可见的网页文本

通过urllib和python下载图片

Firefox 6无限页面刷新带有哈希标签的页面

在引导模态打开时调用一个函数

asp.net mvc3razor文件？

long和int数据types的区别

-pthread标志在编译时的意义

控制表格单元格之间的间距

如何以编程方式确定蓝牙设备是否已连接？ （Android 2.2）

覆盖单个文件的编译标志

如何计算逻辑向量中的TRUE值

计算两个Javadate实例之间的差异

如何垂直居中UITextField文本？

为什么使用Optional.of over Optional.ofNullable？

Python setup.py开发与安装

真正的鼠标在canvas上的位置

CGContextSaveGState：无效上下文0x0（Xcode 7 GM）

如何以编程方式确定蓝牙设备是否已连接？（Android 2.2）