using BeautifulSoup to grab CData

using BeautifulSoup to grab CData

December 30, 2016

one thing you need to be care of when using BeautifulSoup grabbing CData is not to use lxml parser, By default, lxml's parser will strip CDATA sections from the tree and replace them by their plain text content, learn more here https://groups.google.com/forum/?fromgroups=#!topic/beautifulsoup/whLj3jMRq7g

>>> from bs4 import BeautifulSoup
>>> import bs4
>>> s='''<?xml version="1.0" ?>
<foo>
    <bar><![CDATA[
        aaaaaaaaaaaaa
    ]]></bar>
</foo>'''
>>> soup = BeautifulSoup(s, "html.parser")
>>> soup.find(text=lambda tag: isinstance(tag, bs4.CData)).string.strip()
'aaaaaaaaaaaaa'
>>>

Comments

subbhuApril 12, 2017 at 5:48 AM
dats great.. but what if i have multiple CDATA...how do i get them all?
ReplyDelete
Replies
PhilipApril 12, 2017 at 7:41 PM
try soup.findAll()
ReplyDelete
Replies

Post a Comment