using BeautifulSoup to grab CData
one thing you need to be care of when using BeautifulSoup grabbing CData is not to use lxml parser, By default, lxml's parser will strip CDATA sections from the tree and replace them by their plain text content, learn more here https://groups.google.com/forum/?fromgroups=#!topic/beautifulsoup/whLj3jMRq7g
>>> from bs4 import BeautifulSoup
>>> import bs4
>>> s='''<?xml version="1.0" ?>
<foo>
<bar><![CDATA[
aaaaaaaaaaaaa
]]></bar>
</foo>'''
>>> soup = BeautifulSoup(s, "html.parser")
>>> soup.find(text=lambda tag: isinstance(tag, bs4.CData)).string.strip()
'aaaaaaaaaaaaa'
>>>
dats great.. but what if i have multiple CDATA...how do i get them all?
ReplyDeletetry soup.findAll()
ReplyDelete