Proof of Concept – Migrating Notes RichText to Markdown – Part II
As explained here there are some preconditions that have to be met for the conversion. The real-life usage may (and should) be more enhanced. But I wanted to provide a way that achieves the goal in a (hopefully) simple way - so that everybody can easily re-create the results for him- or herself.
Technology decisions
You can get the content of a NotesRichText field in three different flavors:
- native NotesRichText (or DXL)
- MIME
- HTML
The content of a RichText field can i. e. be stored as HTML/MIME when you set a specific field property in the forms design. I’m not doing this - as there was the precondition “Make no modifications to the existing design of the database”. One can say to put a HTTP plug-in or an agent on the Domino server that does the MIME conversion on-the-fly. But don’t forget - I don’t want to put anything on the IBM Domino server. And I want to get the data from a R6 server if needed.
Accessing the field content natively with the NotesRichText classes (in LotusScript of Java) isn’t an option, too. Otherwise I wouldn’t fulfill the precondition “The conversion should run on a machine that has no Notes/Domino technology installed”.
That leaves HTML as the source format to go. HTML (and RichText in general) brings its own caveats which I’ll discuss in the last part of this series.
Having the decision for HTML it’s quite easy to select the communication channel for the conversion process. As the process should be able to run on a remote machine it’ll be HTTP(S). There are also several technologies and libraries available that convert HTML to Markdown and that also do HTTP(S) communication. I’m picking Java as it’s (still) my language of choice.
Nuts and bolts
The realization process is now pretty straight forward.
Two built-in functions in Domino’s HTTP stack give us everything we need for gathering the data.
http://fqhn/path/to/database.nsf/viewname/unid/fieldname?OpenField
This (really old) “trick” returns the content of the given field as HTML right in your browser. The Domino HTTP server does all the heavy lifting of converting the RichText to HTML. The used view can be any view in the database - there are no special preconditions to meet.
As we need the document UNIDs we’re using a similar old URL call to get all documents in the view.
http://fqhn/path/to/database.nsf/viewname?ReadAllEntries
This gives us some metadata of the documents. To have an other format than XML (I don’t like XML) and to get all documents - assuming there are max 1k - we’re appending two parameters to the view.
http://fqhn/path/to/database.nsf/viewname?ReadAllEntries&outputformat=JSON&count=1000
So we’ve now a way to get the RichText of all documents as HTML. The next logical step is to convert the given HTML to Markdown. There are a few Java libraries available that do this job. After evaluating some of them like jHTML2Md or MarkdownJ I’m going with Remark. Remark is easy to use and has some nice features that help in the conversion process. One example is the convertFragment() method that parses only the body tag of the given HTML.
Remark then returns you the Markdown text as a String value which you can then re-use to create Markdown files on your file system, store them in a MongoDB or else.
The next (and final) post for this series shows you how to achieve this in code.