While not part of the API, extant versions of NextCloud news do support full-text scraping of articles in the Web interface. We should support this in the backend as well, especially as it will be an exposed feature in the v2 API.
One wrinkle: feeds are not owned, so whose setting is authoritative when the feed is shared between users? Contention is probably unlikely, but it is possible.
While not part of the API, extant versions of NextCloud news do support full-text scraping of articles in the Web interface. We should support this in the backend as well, especially as it will be an exposed feature in the v2 API.
One wrinkle: feeds are not owned, so whose setting is authoritative when the feed is shared between users? Contention is probably unlikely, but it is possible.
Ideally changing whether a feed is full-content for one user should not affect other users. Since feeds are deduplicated, there's probably only one way to handle this correctly:
Add scraping to a feed's uniqueness constraint
If a feed has only one subscription, simply change the boolean
Otherwise create a new feed ID with the same metadata and different scrape setting (if it doesn't already exist)
Subscriptions probably would need to have multiple associations with different feed IDs at different times to keep associations between subscriptions and article marks, and to avoid existing articles in a feed flooding a subscription when the setting is changed
the Database::articleMark() method would need to take these multiple feeds IDs and times into consideration to mark the correct articles
Ideally changing whether a feed is full-content for one user should not affect other users. Since feeds are deduplicated, there's probably only one way to handle this correctly:
1. Add scraping to a feed's uniqueness constraint
2. If a feed has only one subscription, simply change the boolean
3. Otherwise create a new feed ID with the same metadata and different scrape setting (if it doesn't already exist)
3. Subscriptions probably would need to have multiple associations with different feed IDs at different times to keep associations between subscriptions and article marks, and to avoid existing articles in a feed flooding a subscription when the setting is changed
4. the `Database::articleMark()` method would need to take these multiple feeds IDs and times into consideration to mark the correct articles
Options for how to represent and handle scraping preferences are, I believe, as follows:
Scraped content creates a separate feed when needed, as above
Pros:
Content is only scraped when absolutely required
Provides a good excuse to generalize subscription -> feed re-association
Cons:
Feeds with scraped content may be fetched twice
Does not provide scraped content for existing articles
Content is scraped if any subscription requests it; stored as separate full-content column
Pros:
Content is only scraped when absolutely required
Some users could get scraped content for existing articles
Simpler to implement right away
No duplicate fetching
Cons:
Potentially inconsistent about state of articles before the switch
Content is always scraped; setting only changes which column to return
Pros:
Scraped content is always available, for all articles
Consistent user experience
Simple to implement
Cons:
Significant extra work when fetching, for something which may never be used
Extra storage use, for something which may never be used
Options for how to represent and handle scraping preferences are, I believe, as follows:
#### Scraped content creates a separate feed when needed, as above
Pros:
- Content is only scraped when absolutely required
- Provides a good excuse to generalize subscription -> feed re-association
Cons:
- Feeds with scraped content may be fetched twice
- Does not provide scraped content for existing articles
#### Content is scraped if any subscription requests it; stored as separate full-content column
Pros:
- Content is only scraped when absolutely required
- Some users could get scraped content for existing articles
- Simpler to implement right away
- No duplicate fetching
Cons:
- Potentially inconsistent about state of articles before the switch
#### Content is always scraped; setting only changes which column to return
Pros:
- Scraped content is always available, for all articles
- Consistent user experience
- Simple to implement
Cons:
- Significant extra work when fetching, for something which may never be used
- Extra storage use, for something which may never be used
Content scraping has now been exposed for Miniflux using the second strategy above as of 86897af0b3. If scraping was manually enabled previously it will remain enabled. Subscriptions which have scraping enabled will see scraped content and will also be able to search scraped content; those who do not will not.
Content scraping has now been exposed for Miniflux using the second strategy above as of 86897af0b3e085f3e3e7dd7895a487e34aa898ab. If scraping was manually enabled previously it will remain enabled. Subscriptions which have scraping enabled will see scraped content and will also be able to search scraped content; those who do not will not.
While not part of the API, extant versions of NextCloud news do support full-text scraping of articles in the Web interface. We should support this in the backend as well, especially as it will be an exposed feature in the v2 API.
One wrinkle: feeds are not owned, so whose setting is authoritative when the feed is shared between users? Contention is probably unlikely, but it is possible.
Ideally changing whether a feed is full-content for one user should not affect other users. Since feeds are deduplicated, there's probably only one way to handle this correctly:
Database::articleMark()
method would need to take these multiple feeds IDs and times into consideration to mark the correct articlesOptions for how to represent and handle scraping preferences are, I believe, as follows:
Scraped content creates a separate feed when needed, as above
Pros:
Cons:
Content is scraped if any subscription requests it; stored as separate full-content column
Pros:
Cons:
Content is always scraped; setting only changes which column to return
Pros:
Cons:
Content scraping has now been exposed for Miniflux using the second strategy above as of
86897af0b3
. If scraping was manually enabled previously it will remain enabled. Subscriptions which have scraping enabled will see scraped content and will also be able to search scraped content; those who do not will not.