How It Works: Queue

Overview

The queue is more complex than it may seem at first glance. Beneath the unassuming list is a machine that integrates data from many sources and tries its best to make archiving artwork easy and low-effort. No machine is perfect, however, and understanding its mechanisms and design will make troubleshooting easier.

To begin, the queue is a first-in, first-out list. The images and links you add to it will be processed one by one in the order you added them, in batches of 10.

Supported Sites

Werehouse is able to archive images from all of these websites:

Future development includes plans for the following websites and protocols:

If there’s a place you’d like support for which isn't on either of these lists, please open an issue on GitHub to request that it be added!

Unsupported Sites

Werehouse is not able to archive images from these websites:

Procedure

Werehouse follows the same procedure each time it tries to process a queue item:

  1. Find Sources
    If the queue item is an image, Werehouse first tries to find sources by asking FuzzySearch and Fluffle.xyz. If neither of them know where the image comes from, Werehouse gives up. On the other paw, if the queue item is a link to a webpage, Werehouse assumes that link is the source, and proceeds onwards. This step accepts a queue item and produces a list of links.
  2. Scrape Sources
    Werehouse tries to download information about each of the potential sources it found. This includes obvious things, like the link to the image(s), but also many kinds of metadata, such as maturity level, dimensions, tags, the artist’s profile, and more. This step accepts a list of links and produces a list of scraped image data, one for each link. If the link referred to a post with multiple images (such as on Twitter or Weasyl), the scraped data includes information about all of the images. This step accepts a list of links and produces a list with sub-lists containing the scraped data about each image from each link.
  3. Fetch Images
    Werehouse downloads the full-resolution images from every source (if they weren’t already downloaded). This step accepts a list of sub-lists of scraped image data and produces the same list, but with full-size images included.
  4. Duplicate Check
    For each of the images, Werehouse computes a content hash of the image to see if something similar has already been archived. It also looks through all of the source links currently in the archive to see if the link has been saved before. If it finds a similar hash or a duplicate link, it stops and asks for help. This step accepts a list of sub-lists of scraped image data and if there were no duplicates, passes it right along through.
  5. Add to Archive
    The full-size images and all of the other information are saved to your archive. This step accepts a list of scraped image data. (It produces nothing because it is the final step.)

States

A queue item can be in one of these states: