PicoFeed content scraper integration #60

Closed
opened 7 years ago by jking · 3 comments
jking commented 7 years ago
Owner

While not part of the API, extant versions of NextCloud news do support full-text scraping of articles in the Web interface. We should support this in the backend as well, especially as it will be an exposed feature in the v2 API.

One wrinkle: feeds are not owned, so whose setting is authoritative when the feed is shared between users? Contention is probably unlikely, but it is possible.

While not part of the API, extant versions of NextCloud news do support full-text scraping of articles in the Web interface. We should support this in the backend as well, especially as it will be an exposed feature in the v2 API. One wrinkle: feeds are not owned, so whose setting is authoritative when the feed is shared between users? Contention is probably unlikely, but it is possible.
Poster
Owner

Ideally changing whether a feed is full-content for one user should not affect other users. Since feeds are deduplicated, there's probably only one way to handle this correctly:

  1. Add scraping to a feed's uniqueness constraint
  2. If a feed has only one subscription, simply change the boolean
  3. Otherwise create a new feed ID with the same metadata and different scrape setting (if it doesn't already exist)
  4. Subscriptions probably would need to have multiple associations with different feed IDs at different times to keep associations between subscriptions and article marks, and to avoid existing articles in a feed flooding a subscription when the setting is changed
  5. the Database::articleMark() method would need to take these multiple feeds IDs and times into consideration to mark the correct articles
Ideally changing whether a feed is full-content for one user should not affect other users. Since feeds are deduplicated, there's probably only one way to handle this correctly: 1. Add scraping to a feed's uniqueness constraint 2. If a feed has only one subscription, simply change the boolean 3. Otherwise create a new feed ID with the same metadata and different scrape setting (if it doesn't already exist) 3. Subscriptions probably would need to have multiple associations with different feed IDs at different times to keep associations between subscriptions and article marks, and to avoid existing articles in a feed flooding a subscription when the setting is changed 4. the `Database::articleMark()` method would need to take these multiple feeds IDs and times into consideration to mark the correct articles
Poster
Owner

Options for how to represent and handle scraping preferences are, I believe, as follows:

Scraped content creates a separate feed when needed, as above

Pros:

  • Content is only scraped when absolutely required
  • Provides a good excuse to generalize subscription -> feed re-association

Cons:

  • Feeds with scraped content may be fetched twice
  • Does not provide scraped content for existing articles

Content is scraped if any subscription requests it; stored as separate full-content column

Pros:

  • Content is only scraped when absolutely required
  • Some users could get scraped content for existing articles
  • Simpler to implement right away
  • No duplicate fetching

Cons:

  • Potentially inconsistent about state of articles before the switch

Content is always scraped; setting only changes which column to return

Pros:

  • Scraped content is always available, for all articles
  • Consistent user experience
  • Simple to implement

Cons:

  • Significant extra work when fetching, for something which may never be used
  • Extra storage use, for something which may never be used
Options for how to represent and handle scraping preferences are, I believe, as follows: #### Scraped content creates a separate feed when needed, as above Pros: - Content is only scraped when absolutely required - Provides a good excuse to generalize subscription -> feed re-association Cons: - Feeds with scraped content may be fetched twice - Does not provide scraped content for existing articles #### Content is scraped if any subscription requests it; stored as separate full-content column Pros: - Content is only scraped when absolutely required - Some users could get scraped content for existing articles - Simpler to implement right away - No duplicate fetching Cons: - Potentially inconsistent about state of articles before the switch #### Content is always scraped; setting only changes which column to return Pros: - Scraped content is always available, for all articles - Consistent user experience - Simple to implement Cons: - Significant extra work when fetching, for something which may never be used - Extra storage use, for something which may never be used
Poster
Owner

Content scraping has now been exposed for Miniflux using the second strategy above as of 86897af0b3. If scraping was manually enabled previously it will remain enabled. Subscriptions which have scraping enabled will see scraped content and will also be able to search scraped content; those who do not will not.

Content scraping has now been exposed for Miniflux using the second strategy above as of 86897af0b3e085f3e3e7dd7895a487e34aa898ab. If scraping was manually enabled previously it will remain enabled. Subscriptions which have scraping enabled will see scraped content and will also be able to search scraped content; those who do not will not.
jking closed this issue 3 years ago
jking modified the milestone from Future to 0.9.0 3 years ago
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.