Extract plain body text from tex document

I needed to post just the body text as plain ASCII for a submission to a form and I thought well, that should be easy. But in fact, I struggled a lot to get even decent results. Thus, I decided to post this little how-to, to get the best version of plain text of the body text. I am only interested in the body text which means, that figures, footnotes, equations, headers, page numbers, … should be not be included in the final result. Thus the first section explains some ways to get rid of text/equations/figures/… which do not belong to the body text, and the second section explains the plain text production.

Getting rid of unwanted text

The goal of this section is, to remove all passages with text/figures/tables/etc. that do not belong to the body text.

Pruning environments

First I want to get rid of the beforementioned footnotes, equations, etc. Since a lot of stuff we want to get rid of is written in environments like \begin{} ... \end{} , a useful package called comment already exists. It can be put in the preamble like this:

The example removes the output defined in the given environments (figure, table, …) from the document.

Note: The references cannot longer be resolved and thus, many ?? will appear in the final output. But we are just interested in the body text, right?

Pruning commands

Some commands (footnote, bibliography, …) produce extra text in the document, which belongs not to the real body text. But again, if removed referenced or cited entries will be lost and appear as  ?? in the final output. The following example gives just a snapshot of commands one wants to prune. These are very project and also document class depended, and might be altered. So the final approach is, to get rid of the commands be just renewing the definition by an empty one like {} . If the original command has a number # of parameters, these number needs to be given in the redefinition in  [] .

 

Remove header, footer, page numbers, …

The pagestyle command can be applied as follows, to remove header and footer. This command needs set into the preamble:

Produce plain text output from tex

This section gives examples of how to convert a tex document to plain text.

Flattening the project structure

I have experienced many problems with hierarchical project structures, where the main tex document is split into separate files and included with commands like input, include or import. Thus, I’ve adapted a python script called master2single.py to produce a complete flattened tex file, where all includes, etc. are resolved. It can be executed as follows:

Note 1: While some compiler complain about some wired comments, it is safer to remove all comments on the flattened output via the -c  option.

Note 2: The import command is for me the most convenient version because it seems to have no limitations regarding the project’s folder depth.

Solution 1: pandoc

pandoc is a grate too when it comes to markup language parsing and other text operations. And thus, pandoc does not really produces plain text from the tex document, but a markdown version of it. This is very nice if you want to post the whole document on a web page, but it disrespects the exclusion of environments via the comment  package. Thus, equations and tables, but not figures, are included in the output:

Solution 2: latex2rtf (not really)

For very simple documents, latex2rtf, which comes with the standard latex installation on most of your OS flavours, might be a solution. Unfortunately, it cannot handle most of the simple latex commands and thus results an error.

Solution 3: Convert PDF to plain text (best output)

It sounds counter-intuitive, but first producing a PDF from the tex file and then convert the PDF to plain text seems to produce the most satisfying results. Program solutions are ebook-convert  from Calibre or Pdftotext. I’ve only used Pdftotex, which produces a very satisfying output:

Note 1: The file needs of course not to be flattened with this solution.

Note 2: The  -layout argument preserves the sections’ layout and thus, produces even better results.

Other solutions

There might exist a bunch of other solutions out there, but some of them are outdated, are hard to handle or I haven’t found them yet. Here is an incomplete list of other tools and explanations which might be helpful:

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.