This parser knows the following facts about HTML: * Some tags have no closing tag and should be interpreted as being closed as soon as they are encountered. * The text inside some tags (ie. 'script') may contain tags which are not really part of the document and which should be
| 1463 | return j |
| 1464 | |
| 1465 | class BeautifulSoup(BeautifulStoneSoup): |
| 1466 | |
| 1467 | """This parser knows the following facts about HTML: |
| 1468 | |
| 1469 | * Some tags have no closing tag and should be interpreted as being |
| 1470 | closed as soon as they are encountered. |
| 1471 | |
| 1472 | * The text inside some tags (ie. 'script') may contain tags which |
| 1473 | are not really part of the document and which should be parsed |
| 1474 | as text, not tags. If you want to parse the text as tags, you can |
| 1475 | always fetch it and parse it explicitly. |
| 1476 | |
| 1477 | * Tag nesting rules: |
| 1478 | |
| 1479 | Most tags can't be nested at all. For instance, the occurance of |
| 1480 | a <p> tag should implicitly close the previous <p> tag. |
| 1481 | |
| 1482 | <p>Para1<p>Para2 |
| 1483 | should be transformed into: |
| 1484 | <p>Para1</p><p>Para2 |
| 1485 | |
| 1486 | Some tags can be nested arbitrarily. For instance, the occurance |
| 1487 | of a <blockquote> tag should _not_ implicitly close the previous |
| 1488 | <blockquote> tag. |
| 1489 | |
| 1490 | Alice said: <blockquote>Bob said: <blockquote>Blah |
| 1491 | should NOT be transformed into: |
| 1492 | Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah |
| 1493 | |
| 1494 | Some tags can be nested, but the nesting is reset by the |
| 1495 | interposition of other tags. For instance, a <tr> tag should |
| 1496 | implicitly close the previous <tr> tag within the same <table>, |
| 1497 | but not close a <tr> tag in another table. |
| 1498 | |
| 1499 | <table><tr>Blah<tr>Blah |
| 1500 | should be transformed into: |
| 1501 | <table><tr>Blah</tr><tr>Blah |
| 1502 | but, |
| 1503 | <tr>Blah<table><tr>Blah |
| 1504 | should NOT be transformed into |
| 1505 | <tr>Blah<table></tr><tr>Blah |
| 1506 | |
| 1507 | Differing assumptions about tag nesting rules are a major source |
| 1508 | of problems with the BeautifulSoup class. If BeautifulSoup is not |
| 1509 | treating as nestable a tag your page author treats as nestable, |
| 1510 | try ICantBelieveItsBeautifulSoup, MinimalSoup, or |
| 1511 | BeautifulStoneSoup before writing your own subclass.""" |
| 1512 | |
| 1513 | def __init__(self, *args, **kwargs): |
| 1514 | if not kwargs.has_key('smartQuotesTo'): |
| 1515 | kwargs['smartQuotesTo'] = self.HTML_ENTITIES |
| 1516 | kwargs['isHTML'] = True |
| 1517 | BeautifulStoneSoup.__init__(self, *args, **kwargs) |
| 1518 | |
| 1519 | SELF_CLOSING_TAGS = buildTagMap(None, |
| 1520 | ('br' , 'hr', 'input', 'img', 'meta', |
| 1521 | 'spacer', 'link', 'frame', 'base', 'col')) |
| 1522 |
no test coverage detected