Skip to main content

Fixing Work Imports from Other OTW-based Archives

Hello, hello! This is going to be one of my technical tutorial posts, but it is going to be even more niche than usual. By that I mean this will only be useful to... less than half a dozen people. But these people are my friends! Such as Melo running Superlove, and Agnes running Sunset, two of the OTWA-based writing archive websites.

So, let's start at the beginning. I run fanfiction.lol, a writing archive and community designed for both transformative works and original works alike! My site uses the OTWArchive codebase, which is the codebase powering Archive of Our Own. You can read my original announcement post for more information.

As of right now, there are two ways to add your own work to the site:

  1. New Work: Where you add the work manually by hand.
    • It's important to note that you should definitely be writing and saving your work elsewhere.
    • Light HTML is accepted.
  2. Import Work: Where you link to an already-existing work from another archive.
    • This is an important feature, since fanfiction.lol is designed to be an alternative or mirror for people's work on AO3.

Unfortunately, importing doesn't work at all by default. This has been a problem for all other archives. And so, this is what I needed to fix!

The First Problem

Before anything else, attempting to import a work from Archive of Our Own returns the error:

"We couldn't successfully import that work, sorry: URL is for a work on the Archive. Please bookmark it directly instead."

The upstream OTWA codebase was written by and for Archive of Our Own. It has two built-in protections against importing from AO3 itself:

  1. PERMITTED_HOSTS: A list of hostnames considered "the Archive". Any import URL whose hostname is in this list is rejected before the download even starts.
  2. No AO3-specific parser: This one baffled me. Even if the block is removed, the generic fallback parser (parse_story_from_unknown) grabs the entire <body> of the fetched page, which on an AO3 work page is the full AO3 HTML including navigation, header, footer, etc. and not just the story text.

Both problems need to be fixed.

Part 1: Remove AO3 from PERMITTED_HOSTS

File: config/config.yml

Search for the PERMITTED_HOSTS key (around line ~750). You will find a list that includes AO3's production IP addresses and every AO3 domain variant:

PERMITTED_HOSTS: [
  # Production
  "104.153.64.122",
  "208.85.241.152",
  "208.85.241.157",
  "ao3.org",
  "archiveofourown.com",
  "archiveofourown.net",
  "archiveofourown.org",
  "download.archiveofourown.org",
  "insecure.archiveofourown.org",
  "secure.archiveofourown.org",
  "www.ao3.org",
  "www.archiveofourown.com",
  "www.archiveofourown.net",
  "www.archiveofourown.org",
  # fanfiction.lol  <-- this will be your fork's domain
  "fanfiction.lol",
  "www.fanfiction.lol",
  # Staging
  "insecure-test.archiveofourown.org",
  "test.archiveofourown.org",
  "testdownload.archiveofourown.org"
]

Replace the entire block so it contains only your own archive's hostnames:

PERMITTED_HOSTS: [
  # Your archive's own hostnames — imports from these are blocked (bookmark instead)
  "yourdomain.example",
  "www.yourdomain.example",
  "status.yourdomain.example"
]

Changing this will ensure URLs whose hostname is in this list are blocked from being imported (users should bookmark their own works, not import them) and URLs whose hostname is in this list are allowed in Abuse Reports.

So, keep your own domain(s) in the list. Remove everything AO3-related.

Bonus: Dead key in config/local.yml

You may find a permitted_hosts key (lowercase) in your config/local.yml. Due to how the OTWA config loader works (app_config.merge!(...) in config/application.rb), this key is a different key from PERMITTED_HOSTS (uppercase) and is silently ignored. The ArchiveConfig.PERMITTED_HOSTS method only reads the uppercase key from config.yml. You can safely remove the lowercase permitted_hosts block from local.yml to avoid confusion.

Part 2: Add an AO3-specific Story Parser

Even after removing AO3 from PERMITTED_HOSTS, the import will succeed but the chapter content will be the entire AO3 page HTML: navigation, header, login form, footer, and all. This is because the fallback parser doesn't know AO3's HTML structure.

This is the much more tricky part. You need to add a custom parser that knows how to extract only the story content from an AO3 work page.

Thankfully, I made this for you already!

File: app/models/story_parser.rb

Make the following three changes:

Change 1: Add the SOURCE_AO3 regex constant

Find the block of SOURCE_* constants (around line 57):

SOURCE_LJ = '((live|dead|insane)journal\.com)|journalfen(\.net|\.com)|dreamwidth\.org'.freeze
SOURCE_DW = 'dreamwidth\.org'.freeze
SOURCE_FFNET = '(^|[^A-Za-z0-9-])fanfiction\.net'.freeze
SOURCE_DEVIANTART = 'deviantart\.com'.freeze

Add a new line before SOURCE_LJ:

SOURCE_AO3 = '(archiveofourown\.org|ao3\.org)'.freeze

Change 2: Add ao3 to KNOWN_STORY_PARSERS

Find:

KNOWN_STORY_PARSERS = %w[deviantart dw lj].freeze

You'll see that the importer works for deviantART, Dreamwidth, and LiveJournal by default.

Add ao3 to it:

KNOWN_STORY_PARSERS = %w[ao3 deviantart dw lj].freeze

The list is checked in order and stops at the first match, so ao3 should come first (it is the most specific match for an AO3 URL).

Change 3: Add the parse_story_from_ao3 method

This section required me to inspect the source code for a page of a work on AO3, and use a little regular expression.

Find the parse_story_from_deviantart method and add the following method after it (before shift_chapter_attributes):

def parse_story_from_ao3(_story, detect_tags = true)
  work_params = { chapter_attributes: {} }

  # Title: use the work heading, not the browser tab title
  title_node = @doc.at_css('h2.title.heading')
  work_params[:title] = if title_node
    title_node.inner_text.strip
  else
    @doc.at_css('title')&.inner_text&.sub(/\s*\[Archive of Our Own\]\s*$/i, '')&.strip.to_s
  end

  # Summary
  summary_node = @doc.at_css('.summary.module blockquote.userstuff')
  work_params[:summary] = clean_storytext(summary_node.inner_html) if summary_node

  # Author beginning notes (inside .preface.group, before chapter content)
  preface = @doc.at_css('.preface.group')
  if preface
    notes_node = preface.at_css('.notes.module blockquote.userstuff')
    work_params[:notes] = clean_storytext(notes_node.inner_html) if notes_node
  end

  # Story text: extract only from #chapters .userstuff, not the whole page body
  chapters_div = @doc.at_css('#chapters')
  if chapters_div
    userstuff = chapters_div.at_css('.userstuff')
    storytext = userstuff ? userstuff.inner_html : chapters_div.inner_html
  else
    storytext = @doc.at_css('body')&.inner_html || _story
  end
  work_params[:chapter_attributes][:content] = clean_storytext(storytext)

  if detect_tags
    meta_group = @doc.at_css('dl.work.meta.group')
    if meta_group
      rating = meta_group.css('dd.rating.tags li a.tag').map { |a| a.inner_text.strip }
      work_params[:rating_string] = convert_rating_string(rating.first) if rating.any?

      warnings = meta_group.css('dd.warning.tags li a.tag').map { |a| a.inner_text.strip }
      work_params[:archive_warning_string] = warnings.join(', ') if warnings.any?

      fandoms = meta_group.css('dd.fandom.tags li a.tag').map { |a| a.inner_text.strip }
      work_params[:fandom_string] = clean_tags(fandoms.join(ArchiveConfig.DELIMITER_FOR_OUTPUT)) if fandoms.any?

      relationships = meta_group.css('dd.relationship.tags li a.tag').map { |a| a.inner_text.strip }
      work_params[:relationship_string] = clean_tags(relationships.join(ArchiveConfig.DELIMITER_FOR_OUTPUT)) if relationships.any?

      characters = meta_group.css('dd.character.tags li a.tag').map { |a| a.inner_text.strip }
      work_params[:character_string] = clean_tags(characters.join(ArchiveConfig.DELIMITER_FOR_OUTPUT)) if characters.any?

      freeforms = meta_group.css('dd.freeform.tags li a.tag').map { |a| a.inner_text.strip }
      work_params[:freeform_string] = clean_tags(freeforms.join(ArchiveConfig.DELIMITER_FOR_OUTPUT)) if freeforms.any?

      published = meta_group.at_css('dd.published')
      work_params[:revised_at] = convert_revised_at(published.inner_text.strip) if published
    end
  end

  post_process_meta(work_params)
end

This is what the parser extracts from an AO3 work page:

Field AO3 CSS selector
Title h2.title.heading
Summary .summary.module blockquote.userstuff
Author notes .preface.group .notes.module blockquote.userstuff
Story text #chapters .userstuff
Rating dd.rating.tags li a.tag
Archive warnings dd.warning.tags li a.tag
Fandom(s) dd.fandom.tags li a.tag
Relationship(s) dd.relationship.tags li a.tag
Character(s) dd.character.tags li a.tag
Additional tags dd.freeform.tags li a.tag
Published date dd.published

Part 3: Deploy and Restart

I am using Docker to run my project, which means this is how I deploy and restart the Rails application for my site to have these changes take effect

docker restart <your-web-container-name>

For standard setups, it would look more like this:

touch tmp/restart.txt
# or
rails server restart

How the Parser Dispatch Works (for further customization)

When a URL is submitted for import, story_parser.rb calls get_source_if_known which iterates through KNOWN_STORY_PARSERS and tests each entry's SOURCE_* regex against the URL. The first match wins, and the corresponding parse_story_from_<source> method is called.

This means you can add parsers for any archive, not just AO3. For example, to add a parser for yourdomain.example:

  1. Add SOURCE_YOURARCHIVE = 'yourdomain\.example'.freeze
  2. Add 'yourarchive' to KNOWN_STORY_PARSERS
  3. Write def parse_story_from_yourarchive(_story, detect_tags = true) following the same pattern

Since all OTWA-based forks share the same HTML structure, the same parse_story_from_ao3 method will also work for importing from other OTWA forks! (Superlove, Sunset, SquidgeWorld, etc.) Just add their domains to SOURCE_AO3 or create a separate source constant that shares the same parser method.

For example, to also support imports from superlove.sayitditto.net:

SOURCE_AO3 = '(archiveofourown\.org|ao3\.org|superlove\.sayitditto\.net|sunset\.femslash\.club)'.freeze

Since all these archives use identical OTWA HTML structure, the same parser handles all of them.

Testing, Limitations, and Conclusion

  1. Go to Post > Import Work on your archive
  2. Paste an AO3 work URL (e.g. https://archiveofourown.org/works/12345)
  3. Select a language and press Import
  4. You should reach a preview page showing only the story text, with tags auto-populated from AO3

As of right now, there are limitations with this parser. To start, restricted works (login-required on AO3) cannot be imported. Next, multi-chapter works imported from the main /works/ID URL will only import the first chapter's content (each chapter needs to be imported separately via its /works/ID/chapters/CHAPTER_ID URL, or you can implement chaptered parsing).

Also, the import does not carry over the original author name, as the work is posted under the importing user's account. So please only import your own work or work you have explicit permission to upload!

Anyways, that's all there is! It's really fun to write out these technical guides even when they only serve a small number of people. If you're from the future and have created a new OTWA-fork project, I hope this guide helps you as well!

Comments

To comment, please sign in with your website:

How it works: Your website needs to support IndieAuth. GitHub profiles work out of the box. You can also use IndieAuth.com to authenticate via GitLab, Codeberg, email, or PGP. Setup instructions.

No comments yet. Be the first to share your thoughts!


Webmentions

No webmentions yet. Be the first to send one!


Related Posts

↑ TOP