Paperspan export (HTML) to Instapaper import (CSV) convertor

Oct 6, 2021 · 2 min read · Computer Software Functional Programming Haskell YAML Regular Expression CSV HTML Instapaper Paperspan Internet ·

Share on:

A Haskell program to convert the Paperspan HTML export format to an Instapaper CSV import format with automatic –configuration file driven– designation to folders. The HXT library is used to parse the Paperspan HTML file and the CSV result is written to standard output.

Usage: see the Makefile.

Paperspan format

 1<html>
 2 <head>
 3  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
 4  <title>PaperSpan Export</title>
 5 </head>
 6 <body>
 7  <h1>Unread</h1>
 8  <ul>
 9   <h2>Read Later</h2>
10   <ul>
11     <li><a href="https://thisisalink" time_added="1630506259000">This is a <i>description</i>.</a></li>
12     ...
13   </ul>
14  </ul>
15...
16  <h1>Read</h1>
17 <ul>
18  <h2>Read Later</h2>
19  <ul>
20    <li> ...</li>
21    ...
22  </ul>
23 </ul>
24</body>

Paperspan folders

Existing Paperspan folders are reused by the conversion program. If the Read Later folder is encountered then an automatic designation to folders (via regular expression rules, which are provided in a configuration file) is done. See the next section for details on this.

Automatic designation to folders

A folders.yaml configuration file, which contains Instapaper target folder names (for output file) and regular expressions (PCRE) for URL or text in Paperspan export (which is input). Each of the selector rules in the configuration file (I have hundreds) is matched against the URL or text of the Paperspan link being imported, until a match is found and an associated folder can be designated to it. This is very useful when you have a lot of unorganized links in your Paperspan (which you did not yet move to a folder).

e.g. the Paperspan link

1<a href="https://news360.com/article/563394549"
2   time_added="1630495255000">
3  Stop prescribing hydroxychloroquine for Covid-19, warn researchers | Stop News – India TV
4</a>

is matched with the following selector from folders.yaml:

1- "conditionRegExp": "\\bcovid-19\\b"
2  "conditionSource": "text"
3  "conditionFolderName": "biologyHealth"

which results in designation to the Biology Health folder via its folderName (also in folders.yaml).

1- "folderName": "biologyHealth"
2  "folderPath": "Biology Health"

and the following CSV line is the result:

1https://news360.com/article/563394549,
2"Stop prescribing hydroxychloroquine for Covid-19, warn researchers | Stop News – India TV",
3https://news360.com/article/563394549,
4"Biology Health",
51630495255000

Codeberg

The source code for the convertor program is on Codeberg: photonsphere/paperspan2instapaper.

Disclaimer: this is a 'one shot' program (excuse my Haskell) that I've used only once to import an export of my 27,689 Paperspan article links into Instapaper. Update: still 2,140 undesignated links left; further refining program; adding more selector rules.

Earlier

Earlier this 2019-02-18-instapaper-export was used for an Instpaper HTML export to Paperspan import.

(Hopping back and forth between these excellent read-later/archiving solutions.)