I needed to post just the body text as plain ASCII for a submission to a form and I thought well, that should be easy. But in fact, I struggled a lot to get even decent results. Thus, I decided to post this little how-to, to get the best version of plain text of the body text. I am only interested in the body text which means, that figures, footnotes, equations, headers, page numbers, … should be not be included in the final result. Thus the first section explains some ways to get rid of text/equations/figures/… which do not belong to the body text, and the second section explains the plain text production.
Getting rid of unwanted text
The goal of this section is, to remove all passages with text/figures/tables/etc. that do not belong to the body text.
Pruning environments
First I want to get rid of the beforementioned footnotes, equations, etc. Since a lot of stuff we want to get rid of is written in environments like \begin{} ... \end{} , a useful package called comment already exists. It can be put in the preamble like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
% ... \usepackage{comment} % Remove figure environment \excludecomment{figure} \let\endfigure\relax % Remove table environment \excludecomment{table} \let\endtable\relax % Remove equation environment \excludecomment{equation} \let\endequation\relax % Remove align environment \excludecomment{align} \let\endalign\relax % Remove gather environment \excludecomment{gather} \let\endgather\relax % ... \begin{document} % ... |
The example removes the output defined in the given environments (figure, table, …) from the document.
Note: The references cannot longer be resolved and thus, many ?? will appear in the final output. But we are just interested in the body text, right?
Pruning commands
Some commands (footnote, bibliography, …) produce extra text in the document, which belongs not to the real body text. But again, if removed referenced or cited entries will be lost and appear as ?? in the final output. The following example gives just a snapshot of commands one wants to prune. These are very project and also document class depended, and might be altered. So the final approach is, to get rid of the commands be just renewing the definition by an empty one like {} . If the original command has a number # of parameters, these number needs to be given in the redefinition in [] .
1 2 3 4 5 6 7 8 9 10 |
% ... \renewcommand{\footnote}[1]{} \renewcommand{\bibliography}[1]{} \renewcommand{\linenumbers}{} \renewcommand{\maketitle}{} % ... \begin{document} % ... |
Remove header, footer, page numbers, …
The pagestyle command can be applied as follows, to remove header and footer. This command needs set into the preamble:
1 2 3 4 5 6 7 |
% ... \pagestyle{empty} % ... \begin{document} % ... |
Produce plain text output from tex
This section gives examples of how to convert a tex document to plain text.
Flattening the project structure
I have experienced many problems with hierarchical project structures, where the main tex document is split into separate files and included with commands like input, include or import. Thus, I’ve adapted a python script called master2single.py to produce a complete flattened tex file, where all includes, etc. are resolved. It can be executed as follows:
1 |
master2single.py main.tex main_flattend.tex -c |
Note 1: While some compiler complain about some wired comments, it is safer to remove all comments on the flattened output via the -c option.
Note 2: The import command is for me the most convenient version because it seems to have no limitations regarding the project’s folder depth.
Solution 1: pandoc
pandoc is a grate too when it comes to markup language parsing and other text operations. And thus, pandoc does not really produces plain text from the tex document, but a markdown version of it. This is very nice if you want to post the whole document on a web page, but it disrespects the exclusion of environments via the comment package. Thus, equations and tables, but not figures, are included in the output:
1 |
pandoc -o output.txt main_flattend.tex |
Solution 2: latex2rtf (not really)
For very simple documents, latex2rtf, which comes with the standard latex installation on most of your OS flavours, might be a solution. Unfortunately, it cannot handle most of the simple latex commands and thus results an error.
1 |
latex2rtf main_flattend.tex |
Solution 3: Convert PDF to plain text (best output)
It sounds counter-intuitive, but first producing a PDF from the tex file and then convert the PDF to plain text seems to produce the most satisfying results. Program solutions are ebook-convert from Calibre or Pdftotext. I’ve only used Pdftotex, which produces a very satisfying output:
1 |
pdftotext -layout main.pdf main.txt |
Note 1: The file needs of course not to be flattened with this solution.
Note 2: The -layout argument preserves the sections’ layout and thus, produces even better results.
Other solutions
There might exist a bunch of other solutions out there, but some of them are outdated, are hard to handle or I haven’t found them yet. Here is an incomplete list of other tools and explanations which might be helpful:
- wikibooks: Convert to plain text
- detex
- catdvi
- tex -> HTML -> plain text via pdfreflow