PicoFeed content scraper integration

jking commented

2017-07-17 12:42:19 -04:00

Owner

While not part of the API, extant versions of NextCloud news do support full-text scraping of articles in the Web interface. We should support this in the backend as well, especially as it will be an exposed feature in the v2 API.

One wrinkle: feeds are not owned, so whose setting is authoritative when the feed is shared between users? Contention is probably unlikely, but it is possible.

While not part of the API, extant versions of NextCloud news do support full-text scraping of articles in the Web interface. We should support this in the backend as well, especially as it will be an exposed feature in the v2 API. One wrinkle: feeds are not owned, so whose setting is authoritative when the feed is shared between users? Contention is probably unlikely, but it is possible.

jking referenced this issue from a commit

2017-07-17 14:56:58 -04:00

Basic support for PicoFeed content scraping...

jking commented

2017-07-17 15:15:47 -04:00

Author

Owner

Ideally changing whether a feed is full-content for one user should not affect other users. Since feeds are deduplicated, there's probably only one way to handle this correctly:

Add scraping to a feed's uniqueness constraint
If a feed has only one subscription, simply change the boolean
Otherwise create a new feed ID with the same metadata and different scrape setting (if it doesn't already exist)
Subscriptions probably would need to have multiple associations with different feed IDs at different times to keep associations between subscriptions and article marks, and to avoid existing articles in a feed flooding a subscription when the setting is changed
the Database::articleMark() method would need to take these multiple feeds IDs and times into consideration to mark the correct articles

Ideally changing whether a feed is full-content for one user should not affect other users. Since feeds are deduplicated, there's probably only one way to handle this correctly: 1. Add scraping to a feed's uniqueness constraint 2. If a feed has only one subscription, simply change the boolean 3. Otherwise create a new feed ID with the same metadata and different scrape setting (if it doesn't already exist) 3. Subscriptions probably would need to have multiple associations with different feed IDs at different times to keep associations between subscriptions and article marks, and to avoid existing articles in a feed flooding a subscription when the setting is changed 4. the `Database::articleMark()` method would need to take these multiple feeds IDs and times into consideration to mark the correct articles

jking commented

2017-09-07 07:56:29 -04:00

Author

Owner

Options for how to represent and handle scraping preferences are, I believe, as follows:

Scraped content creates a separate feed when needed, as above

Pros:

Content is only scraped when absolutely required
Provides a good excuse to generalize subscription -> feed re-association

Cons:

Feeds with scraped content may be fetched twice
Does not provide scraped content for existing articles

Content is scraped if any subscription requests it; stored as separate full-content column

Pros:

Content is only scraped when absolutely required
Some users could get scraped content for existing articles
Simpler to implement right away
No duplicate fetching

Cons:

Potentially inconsistent about state of articles before the switch

Content is always scraped; setting only changes which column to return

Pros:

Scraped content is always available, for all articles
Consistent user experience
Simple to implement

Cons:

Significant extra work when fetching, for something which may never be used
Extra storage use, for something which may never be used

Options for how to represent and handle scraping preferences are, I believe, as follows: #### Scraped content creates a separate feed when needed, as above Pros: - Content is only scraped when absolutely required - Provides a good excuse to generalize subscription -> feed re-association Cons: - Feeds with scraped content may be fetched twice - Does not provide scraped content for existing articles #### Content is scraped if any subscription requests it; stored as separate full-content column Pros: - Content is only scraped when absolutely required - Some users could get scraped content for existing articles - Simpler to implement right away - No duplicate fetching Cons: - Potentially inconsistent about state of articles before the switch #### Content is always scraped; setting only changes which column to return Pros: - Scraped content is always available, for all articles - Consistent user experience - Simple to implement Cons: - Significant extra work when fetching, for something which may never be used - Extra storage use, for something which may never be used

jking commented

2021-01-16 19:09:44 -05:00

Author

Owner

Content scraping has now been exposed for Miniflux using the second strategy above as of 86897af0b3. If scraping was manually enabled previously it will remain enabled. Subscriptions which have scraping enabled will see scraped content and will also be able to search scraped content; those who do not will not.

Content scraping has now been exposed for Miniflux using the second strategy above as of 86897af0b3e085f3e3e7dd7895a487e34aa898ab. If scraping was manually enabled previously it will remain enabled. Subscriptions which have scraping enabled will see scraped content and will also be able to search scraped content; those who do not will not.

jking closed this issue

2021-01-16 19:09:44 -05:00

jking modified the milestone from Future to 0.9.0

2021-01-16 19:09:55 -05:00

Rows
Columns

PicoFeed content scraper integration #60

Scraped content creates a separate feed when needed, as above

Content is scraped if any subscription requests it; stored as separate full-content column

Content is always scraped; setting only changes which column to return