'Zig-zag Passenger and Freight Train' by an unknown artist. Original from Library of Congress.| Rawpixel
Fixing Work Imports from Other OTW-based Archives
Hello, hello! This is going to be one of my technical tutorial posts, but it is going to be even more niche than usual. By that I mean this will only be useful to... less than half a dozen people. But these people are my friends! Such as Melo running Superlove, and Agnes running Sunset, two of the OTWA-based writing archive websites.
So, let's start at the beginning. I run fanfiction.lol, a writing archive and community designed for both transformative works and original works alike! My site uses the OTWArchive codebase, which is the codebase powering Archive of Our Own. You can read my original announcement post for more information.
As of right now, there are two ways to add your own work to the site:
- New Work: Where you add the work manually by hand.
- It's important to note that you should definitely be writing and saving your work elsewhere.
- Light HTML is accepted.
- Import Work: Where you link to an already-existing work from another archive.
- This is an important feature, since fanfiction.lol is designed to be an alternative or mirror for people's work on AO3.
Unfortunately, importing doesn't work at all by default. This has been a problem for all other archives. And so, this is what I needed to fix!
The First Problem
Before anything else, attempting to import a work from Archive of Our Own returns the error:
"We couldn't successfully import that work, sorry: URL is for a work on the Archive. Please bookmark it directly instead."
The upstream OTWA codebase was written by and for Archive of Our Own. It has two built-in protections against importing from AO3 itself:
PERMITTED_HOSTS: A list of hostnames considered "the Archive". Any import URL whose hostname is in this list is rejected before the download even starts.- No AO3-specific parser: This one baffled me. Even if the block is removed, the generic fallback parser (
parse_story_from_unknown) grabs the entire<body>of the fetched page, which on an AO3 work page is the full AO3 HTML including navigation, header, footer, etc. and not just the story text.
Both problems need to be fixed.
Part 1: Remove AO3 from PERMITTED_HOSTS
File: config/config.yml
Search for the PERMITTED_HOSTS key (around line ~750). You will find a list that includes AO3's production IP addresses and every AO3 domain variant:
PERMITTED_HOSTS: [
# Production
"104.153.64.122",
"208.85.241.152",
"208.85.241.157",
"ao3.org",
"archiveofourown.com",
"archiveofourown.net",
"archiveofourown.org",
"download.archiveofourown.org",
"insecure.archiveofourown.org",
"secure.archiveofourown.org",
"www.ao3.org",
"www.archiveofourown.com",
"www.archiveofourown.net",
"www.archiveofourown.org",
# fanfiction.lol <-- this will be your fork's domain
"fanfiction.lol",
"www.fanfiction.lol",
# Staging
"insecure-test.archiveofourown.org",
"test.archiveofourown.org",
"testdownload.archiveofourown.org"
]
Replace the entire block so it contains only your own archive's hostnames:
PERMITTED_HOSTS: [
# Your archive's own hostnames — imports from these are blocked (bookmark instead)
"yourdomain.example",
"www.yourdomain.example",
"status.yourdomain.example"
]
Changing this will ensure URLs whose hostname is in this list are blocked from being imported (users should bookmark their own works, not import them) and URLs whose hostname is in this list are allowed in Abuse Reports.
So, keep your own domain(s) in the list. Remove everything AO3-related.
Bonus: Dead key in config/local.yml
You may find a permitted_hosts key (lowercase) in your config/local.yml. Due to how the OTWA config loader works (app_config.merge!(...) in config/application.rb), this key is a different key from PERMITTED_HOSTS (uppercase) and is silently ignored. The ArchiveConfig.PERMITTED_HOSTS method only reads the uppercase key from config.yml. You can safely remove the lowercase permitted_hosts block from local.yml to avoid confusion.
Part 2: Add an AO3-specific Story Parser
Even after removing AO3 from PERMITTED_HOSTS, the import will succeed but the chapter content will be the entire AO3 page HTML: navigation, header, login form, footer, and all. This is because the fallback parser doesn't know AO3's HTML structure.
This is the much more tricky part. You need to add a custom parser that knows how to extract only the story content from an AO3 work page.
Thankfully, I made this for you already!
File: app/models/story_parser.rb
Make the following three changes:
Change 1: Add the SOURCE_AO3 regex constant
Find the block of SOURCE_* constants (around line 57):
SOURCE_LJ = '((live|dead|insane)journal\.com)|journalfen(\.net|\.com)|dreamwidth\.org'.freeze
SOURCE_DW = 'dreamwidth\.org'.freeze
SOURCE_FFNET = '(^|[^A-Za-z0-9-])fanfiction\.net'.freeze
SOURCE_DEVIANTART = 'deviantart\.com'.freeze
Add a new line before SOURCE_LJ:
SOURCE_AO3 = '(archiveofourown\.org|ao3\.org)'.freeze
Change 2: Add ao3 to KNOWN_STORY_PARSERS
Find:
KNOWN_STORY_PARSERS = %w[deviantart dw lj].freeze
You'll see that the importer works for deviantART, Dreamwidth, and LiveJournal by default.
Add ao3 to it:
KNOWN_STORY_PARSERS = %w[ao3 deviantart dw lj].freeze
The list is checked in order and stops at the first match, so ao3 should come first (it is the most specific match for an AO3 URL).
Change 3: Add the parse_story_from_ao3 method
This section required me to inspect the source code for a page of a work on AO3, and use a little regular expression.
Find the parse_story_from_deviantart method and add the following method after it (before shift_chapter_attributes):
def parse_story_from_ao3(_story, detect_tags = true)
work_params = { chapter_attributes: {} }
# Title: use the work heading, not the browser tab title
title_node = @doc.at_css('h2.title.heading')
work_params[:title] = if title_node
title_node.inner_text.strip
else
@doc.at_css('title')&.inner_text&.sub(/\s*\[Archive of Our Own\]\s*$/i, '')&.strip.to_s
end
# Summary
summary_node = @doc.at_css('.summary.module blockquote.userstuff')
work_params[:summary] = clean_storytext(summary_node.inner_html) if summary_node
# Author beginning notes (inside .preface.group, before chapter content)
preface = @doc.at_css('.preface.group')
if preface
notes_node = preface.at_css('.notes.module blockquote.userstuff')
work_params[:notes] = clean_storytext(notes_node.inner_html) if notes_node
end
# Story text: extract only from #chapters .userstuff, not the whole page body
chapters_div = @doc.at_css('#chapters')
if chapters_div
userstuff = chapters_div.at_css('.userstuff')
storytext = userstuff ? userstuff.inner_html : chapters_div.inner_html
else
storytext = @doc.at_css('body')&.inner_html || _story
end
work_params[:chapter_attributes][:content] = clean_storytext(storytext)
if detect_tags
meta_group = @doc.at_css('dl.work.meta.group')
if meta_group
rating = meta_group.css('dd.rating.tags li a.tag').map { |a| a.inner_text.strip }
work_params[:rating_string] = convert_rating_string(rating.first) if rating.any?
warnings = meta_group.css('dd.warning.tags li a.tag').map { |a| a.inner_text.strip }
work_params[:archive_warning_string] = warnings.join(', ') if warnings.any?
fandoms = meta_group.css('dd.fandom.tags li a.tag').map { |a| a.inner_text.strip }
work_params[:fandom_string] = clean_tags(fandoms.join(ArchiveConfig.DELIMITER_FOR_OUTPUT)) if fandoms.any?
relationships = meta_group.css('dd.relationship.tags li a.tag').map { |a| a.inner_text.strip }
work_params[:relationship_string] = clean_tags(relationships.join(ArchiveConfig.DELIMITER_FOR_OUTPUT)) if relationships.any?
characters = meta_group.css('dd.character.tags li a.tag').map { |a| a.inner_text.strip }
work_params[:character_string] = clean_tags(characters.join(ArchiveConfig.DELIMITER_FOR_OUTPUT)) if characters.any?
freeforms = meta_group.css('dd.freeform.tags li a.tag').map { |a| a.inner_text.strip }
work_params[:freeform_string] = clean_tags(freeforms.join(ArchiveConfig.DELIMITER_FOR_OUTPUT)) if freeforms.any?
published = meta_group.at_css('dd.published')
work_params[:revised_at] = convert_revised_at(published.inner_text.strip) if published
end
end
post_process_meta(work_params)
end
This is what the parser extracts from an AO3 work page:
| Field | AO3 CSS selector |
|---|---|
| Title | h2.title.heading |
| Summary | .summary.module blockquote.userstuff |
| Author notes | .preface.group .notes.module blockquote.userstuff |
| Story text | #chapters .userstuff |
| Rating | dd.rating.tags li a.tag |
| Archive warnings | dd.warning.tags li a.tag |
| Fandom(s) | dd.fandom.tags li a.tag |
| Relationship(s) | dd.relationship.tags li a.tag |
| Character(s) | dd.character.tags li a.tag |
| Additional tags | dd.freeform.tags li a.tag |
| Published date | dd.published |
Part 3: Deploy and Restart
I am using Docker to run my project, which means this is how I deploy and restart the Rails application for my site to have these changes take effect
docker restart <your-web-container-name>
For standard setups, it would look more like this:
touch tmp/restart.txt
# or
rails server restart
How the Parser Dispatch Works (for further customization)
When a URL is submitted for import, story_parser.rb calls get_source_if_known which iterates through KNOWN_STORY_PARSERS and tests each entry's SOURCE_* regex against the URL. The first match wins, and the corresponding parse_story_from_<source> method is called.
This means you can add parsers for any archive, not just AO3. For example, to add a parser for yourdomain.example:
- Add
SOURCE_YOURARCHIVE = 'yourdomain\.example'.freeze - Add
'yourarchive'toKNOWN_STORY_PARSERS - Write
def parse_story_from_yourarchive(_story, detect_tags = true)following the same pattern
Since all OTWA-based forks share the same HTML structure, the same parse_story_from_ao3 method will also work for importing from other OTWA forks! (Superlove, Sunset, SquidgeWorld, etc.) Just add their domains to SOURCE_AO3 or create a separate source constant that shares the same parser method.
For example, to also support imports from superlove.sayitditto.net:
SOURCE_AO3 = '(archiveofourown\.org|ao3\.org|superlove\.sayitditto\.net|sunset\.femslash\.club)'.freeze
Since all these archives use identical OTWA HTML structure, the same parser handles all of them.
Testing, Limitations, and Conclusion
- Go to Post > Import Work on your archive
- Paste an AO3 work URL (e.g.
https://archiveofourown.org/works/12345) - Select a language and press Import
- You should reach a preview page showing only the story text, with tags auto-populated from AO3
As of right now, there are limitations with this parser. To start, restricted works (login-required on AO3) cannot be imported. Next, multi-chapter works imported from the main /works/ID URL will only import the first chapter's content (each chapter needs to be imported separately via its /works/ID/chapters/CHAPTER_ID URL, or you can implement chaptered parsing).
Also, the import does not carry over the original author name, as the work is posted under the importing user's account. So please only import your own work or work you have explicit permission to upload!
Anyways, that's all there is! It's really fun to write out these technical guides even when they only serve a small number of people. If you're from the future and have created a new OTWA-fork project, I hope this guide helps you as well!
Comments
To comment, please sign in with your website:
How it works: Your website needs to support IndieAuth. GitHub profiles work out of the box. You can also use IndieAuth.com to authenticate via GitLab, Codeberg, email, or PGP. Setup instructions.
Signed in as:
No comments yet. Be the first to share your thoughts!