A lax Web news feed parser
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 

419 lines
10 KiB

Minimal feed:
input: >
<rss version="2.0"/>
output:
format: rss
version: '2.0'
Channel GUID:
input: >
<rss>
<channel>
<guid isPermalink="false">http://example.com/</guid>
</channel>
</rss>
output:
format: rss
id: 'http://example.com/'
Channel GUID with whitespace:
input: >
<rss>
<channel>
<guid isPermalink="false">
http://example.com/
</guid>
</channel>
</rss>
output:
format: rss
id: 'http://example.com/'
Root GUID: # Any elements on the RSS2 root element should be ignored
input: >
<rss>
<channel/>
<guid isPermalink="false">http://example.com/</guid>
</rss>
output:
format: rss
Bogus GUID before good:
input: >
<rss>
<channel>
<guid/>
<guid isPermalink="false">http://example.com/</guid>
</channel>
</rss>
output:
format: rss
id: 'http://example.com/'
Schedule interval 1:
input: >
<rss><channel>
<ttl>60</ttl>
</channel></rss>
output:
format: rss
sched:
interval: PT60M
Schedule interval 2:
input: >
<rss><channel>
<ttl>bogus</ttl>
<ttl>60</ttl>
</channel></rss>
output:
format: rss
sched:
interval: PT60M
Schedule interval 3:
input: >
<rss><channel>
<ttl>
0120
</ttl>
</channel></rss>
output:
format: rss
sched:
interval: PT120M
Schedule interval 4:
input: >
<rss><channel>
<ttl>0</ttl>
</channel></rss>
output:
format: rss
Skip days:
input: >
<rss><channel>
<skipDays>
<day>MONDAY</day>
<day>sunday</day>
<day>TuE </day>
<day>bogus</day>
</skipDays>
</channel></rss>
output:
format: rss
sched:
skip: 1124073472 # 0b1000011000000000000000000000000
Skip all days:
input: >
<rss><channel>
<skipDays>
<day>mon</day>
<day>tue</day>
<day>wed</day>
<day>thu</day>
<day>fri</day>
<day>sat</day>
<day>sun</day>
</skipDays>
</channel></rss>
output:
format: rss
sched:
expired: true
skip: 2130706432 # 0b1111111000000000000000000000000
Skip hours:
input: >
<rss><channel>
<skipHours>
<hour>24</hour>
<hour>04</hour>
<hour> 9</hour>
<hour>bogus</hour>
<hour>25</hour>
</skipHours>
</channel></rss>
output:
format: rss
sched:
skip: 529 # 0b000000000000001000010001
Skip all hours:
input: >
<rss><channel>
<skipHours>
<hour>00</hour>
<hour>01</hour>
<hour>02</hour>
<hour>03</hour>
<hour>04</hour>
<hour>05</hour>
<hour>06</hour>
<hour>07</hour>
<hour>08</hour>
<hour>09</hour>
<hour>10</hour>
<hour>11</hour>
<hour>12</hour>
<hour>13</hour>
<hour>14</hour>
<hour>15</hour>
<hour>16</hour>
<hour>17</hour>
<hour>18</hour>
<hour>19</hour>
<hour>20</hour>
<hour>21</hour>
<hour>22</hour>
<hour>23</hour>
</skipHours>
</channel></rss>
output:
format: rss
sched:
expired: true
skip: 16777215 # 0b111111111111111111111111
Skip all days and hours:
input: >
<rss><channel>
<skipDays>
<day>mon</day>
<day>tue</day>
<day>wed</day>
<day>thu</day>
<day>fri</day>
<day>sat</day>
<day>sun</day>
</skipDays>
<skipHours>
<hour>00</hour>
<hour>01</hour>
<hour>02</hour>
<hour>03</hour>
<hour>04</hour>
<hour>05</hour>
<hour>06</hour>
<hour>07</hour>
<hour>08</hour>
<hour>09</hour>
<hour>10</hour>
<hour>11</hour>
<hour>12</hour>
<hour>13</hour>
<hour>14</hour>
<hour>15</hour>
<hour>16</hour>
<hour>17</hour>
<hour>18</hour>
<hour>19</hour>
<hour>20</hour>
<hour>21</hour>
<hour>22</hour>
<hour>23</hour>
</skipHours>
</channel></rss>
output:
format: rss
sched:
expired: true
skip: 2147483647 # 0b1111111111111111111111111111111
Feed language:
input: >
<rss><channel>
<language>ja</language>
</channel></rss>
output:
format: rss
lang: ja
Feed link:
input: >
<rss><channel>
<link/>
<link>http://[example.net]/</link>
<link>http://example.com/</link>
</channel></rss>
output:
format: rss
link: 'http://example.com/'
Feed link via GUID 1:
input: >
<rss><channel>
<guid>http://example.com/</guid>
</channel></rss>
output:
format: rss
id: 'http://example.com/'
link: 'http://example.com/'
Feed link via GUID 2:
input: >
<rss><channel>
<guid isPermalink='true'>http://example.com/</guid>
</channel></rss>
output:
format: rss
id: 'http://example.com/'
link: 'http://example.com/'
GUID not a link:
input: >
<rss><channel>
<guid isPermalink='maybe not'>http://example.com/</guid>
</channel></rss>
output:
format: rss
id: 'http://example.com/'
Explicit link preferred:
input: >
<rss><channel>
<guid>http://example.net/</guid>
<link>http://example.com/</link>
</channel></rss>
output:
format: rss
id: 'http://example.net/'
link: 'http://example.com/'
GUID not a url:
input: >
<rss><channel>
<guid>http://[example.com]/</guid>
</channel></rss>
output:
format: rss
id: 'http://[example.com]/'
Relative feed link:
input: >
<rss><channel xml:base="http://example.com/path/">
<link>/</link>
</channel></rss>
output:
format: rss
link: ['/', 'http://example.com/path/']
Feed title 1:
input: >
<rss><channel>
<title> Loose text</title>
</channel></rss>
output:
format: rss
title:
loose: 'Loose text'
Feed title 2:
input: >
<rss><channel>
<title xml:base="https://example.com/"> Loose text</title>
</channel></rss>
output:
format: rss
title:
loose: 'Loose text'
htmlBase: 'https://example.com/'
Feed summary:
input: >
<rss><channel>
<description>Loose text</description>
</channel></rss>
output:
format: rss
summary:
loose: 'Loose text'
Feed publication date:
input: >
<rss><channel>
<pubDate>2020-03-03T00:00:00Z</pubDate>
</channel></rss>
output:
format: rss
dateModified: '2020-03-03T00:00:00Z'
Feed build date:
input: >
<rss><channel>
<lastBuildDate>2020-03-03T00:00:00Z</lastBuildDate>
</channel></rss>
output:
format: rss
dateModified: '2020-03-03T00:00:00Z'
Multiple dates 1:
input: >
<rss><channel>
<pubDate>2020-03-03T00:00:00Z</pubDate>
<pubDate>2020-03-03T00:00:00-04:00</pubDate>
<lastBuildDate>2020-03-03T00:00:00Z</lastBuildDate>
</channel></rss>
output:
format: rss
dateModified: '2020-03-03T00:00:00-04:00'
Multiple dates 2:
input: >
<rss><channel>
<pubDate>2020-03-03T00:00:00Z</pubDate>
<lastBuildDate>2020-03-03T00:00:00-04:00</lastBuildDate>
<pubDate>2020-03-03T00:00:00Z</pubDate>
</channel></rss>
output:
format: rss
dateModified: '2020-03-03T00:00:00-04:00'
Channel image:
input: >
<rss><channel>
<image>
<url>http://example.com/</url>
</image>
</channel></rss>
output:
format: rss
image: 'http://example.com/'
Category:
input: >
<rss><channel>
<category>Category the first </category>
<category domain="ook eek">Category the second </category>
<category/>
</channel></rss>
output:
format: rss
categories:
- name: 'Category the first'
- name: 'Category the second'
domain: 'ook eek'
Authors, editors, and webmasters:
input: >
<rss><channel>
<author>jane.doe@example.com (Jane Doe)</author>
<author>john.doe@example.com</author>
<managingEditor>Larry</managingEditor>
<webMaster>Curly</webMaster>
</channel></rss>
output:
format: rss
people:
- name: 'Jane Doe'
mail: 'jane.doe@example.com'
role: author
- name: 'john.doe@example.com'
mail: 'john.doe@example.com'
role: author
- name: Larry
role: editor
- name: Curly
role: webmaster