Extract plain body text from tex document

I needed to post just the body text as plain ASCII for a submission to a form and I thought well, that should be easy. But in fact, I struggled a lot to get even decent results. Thus, I decided to post this little how-to, to get the best version of plain text of the body text. I am only interested in the body text which means, that figures, footnotes, equations, headers, page numbers, … should be not be included in the final result. Thus the first section explains some ways to get rid of text/equations/figures/… which do not belong to the body text, and the second section explains the plain text production.

Getting rid of unwanted text

The goal of this section is, to remove all passages with text/figures/tables/etc. that do not belong to the body text.

Pruning environments

First I want to get rid of the beforementioned footnotes, equations, etc. Since a lot of stuff we want to get rid of is written in environments like \begin{} ... \end{} , a useful package called comment already exists. It can be put in the preamble like this:

The example removes the output defined in the given environments (figure, table, …) from the document.

Note: The references cannot longer be resolved and thus, many ?? will appear in the final output. But we are just interested in the body text, right?

Pruning commands

Some commands (footnote, bibliography, …) produce extra text in the document, which belongs not to the real body text. But again, if removed referenced or cited entries will be lost and appear as  ?? in the final output. The following example gives just a snapshot of commands one wants to prune. These are very project and also document class depended, and might be altered. So the final approach is, to get rid of the commands be just renewing the definition by an empty one like {} . If the original command has a number # of parameters, these number needs to be given in the redefinition in  [] .

 

Remove header, footer, page numbers, …

The pagestyle command can be applied as follows, to remove header and footer. This command needs set into the preamble:

Produce plain text output from tex

This section gives examples of how to convert a tex document to plain text.

Flattening the project structure

I have experienced many problems with hierarchical project structures, where the main tex document is split into separate files and included with commands like input, include or import. Thus, I’ve adapted a python script called master2single.py to produce a complete flattened tex file, where all includes, etc. are resolved. It can be executed as follows:

Note 1: While some compiler complain about some wired comments, it is safer to remove all comments on the flattened output via the -c  option.

Note 2: The import command is for me the most convenient version because it seems to have no limitations regarding the project’s folder depth.

Solution 1: pandoc

pandoc is a grate too when it comes to markup language parsing and other text operations. And thus, pandoc does not really produces plain text from the tex document, but a markdown version of it. This is very nice if you want to post the whole document on a web page, but it disrespects the exclusion of environments via the comment  package. Thus, equations and tables, but not figures, are included in the output:

Solution 2: latex2rtf (not really)

For very simple documents, latex2rtf, which comes with the standard latex installation on most of your OS flavours, might be a solution. Unfortunately, it cannot handle most of the simple latex commands and thus results an error.

Solution 3: Convert PDF to plain text (best output)

It sounds counter-intuitive, but first producing a PDF from the tex file and then convert the PDF to plain text seems to produce the most satisfying results. Program solutions are ebook-convert  from Calibre or Pdftotext. I’ve only used Pdftotex, which produces a very satisfying output:

Note 1: The file needs of course not to be flattened with this solution.

Note 2: The  -layout argument preserves the sections’ layout and thus, produces even better results.

Other solutions

There might exist a bunch of other solutions out there, but some of them are outdated, are hard to handle or I haven’t found them yet. Here is an incomplete list of other tools and explanations which might be helpful:

 

Run bitbake in Virtual Box with Shared Folder

I just had to update the build host for the embedded operating system of the AMiRo Autonomous Mini Robot to Ubuntu 16.04. Since the old-fashioned Yocto 1.7 only bakes on Ubuntu 14.04 I had to virtualize the machine. But since the downloads and builds take up a huge amount of space, my approach was to keep the virtualized build distribution (aka guest) as slim as possible and make use of the vast amount of super fast SSD_PCIE space of the Virtual Box host. So there are two possibilities:
APPROACH 1: Make use of Virtual Box Shared Folder so that the guest accesses a folder on the SSD_PCIE drive as  a network drive
APPROACH 2: Physically mounted the SSD_PCIE disk inside the guest distribution

To bypass any long readings: APPROACH 1 does not work due to unsupported hard links between host and guest and second, it is horribly slow. APPROACH 2 is the way to go and runs super fast. For didactical reasons, both approaches are listed.

Tested on the following configuration

  • Host: Ubuntu Xenial 16.04, VBox 5.2.2 + GuestAdditions
  • Guest: Ubuntu Trusty 14.04

APPROACH 1: Shared Folder

Used acronym: VB_SF – VirtualBox Shared Folder

Shared folder settings

  • Location on host (Choose any location you like): /mnt/ssd_pcie/yocto_build_root
  • Name: yocto_build_root
  • Shared folder settings: Read only, [ ] Automount, [x] Permanent
  • Add necessary users to groups
    • /etc/group on host: vboxusers:x:<gid>:<host_user>
    • /etc/group on guest: vboxsf:x:<gid>:<guest_user>

Issue 1: Symlinks are disabled for security reasons in VB_SF

Courtesy: https://ahtik.com/fixing-your-virtualbox-shared-folder-symlink-error/

Enable Symlinks in Virtual Box by entering the following commands on your host machine:

Example:

Issue 2: Correct permission on VB_SF to fix issues like “tar wants to set utime”

Courtesy: http://redwarrior.org/blog/?p=61

For some reason, uid and gid are not correctly set in Virtual Box guests, so it needs to be done manually. Add to /etc/rc.local on the guest system (mount yocto_build_root to /media/yocto_build_root):

Issue 3: Hard links are not allowed in shared folders

Only allow Symlinks adding a script (alias won’t work) that fakes the native ln command. Execute the following line on the guest system:

Issue 4: Insufficient inods on VB_SF (can be inspected by ‘df -i’)

If you execute bitbake on the guest, it will complain about insufficient inods. So you have to decrease the numbers of available inods in the sanity check of the bitbake configuration in ./build/conf/local.conf from 100k/1k to 999:

Issue 5 UNSOLVED: If tar contains hard links, they cannot be extracted

Don’t know how to fix this, so APPROACH 2 is the way to go!

APPROACH 2: Physically mounted disk

Courtesy: https://superuser.com/questions/495025/use-physical-harddisk-in-virtual-box

  • Add user to ‘disk’ group on host: usermod -a -G disk <host_user>
  • Create a vmdk file of the hard drive (not partition): VBoxManage internalcommands
  • createrawvmdk -filename /path/to/file.vmdk -rawdisk /dev/<disk>
  • Add vmdk file to mountable discs
  • Set mode: File -> Virtual Media Manager -> Choose disc -> Type: WriteThrough

Prepare guest for bitbake

Courtesy: http://www.yoctoproject.org/docs/1.7.3/ref-manual/ref-manual.html

 

Back from Brazil

We struggled and fought and reached the 4th place at the RoboCup@Home competition. More posts are coming again from now on.

Running ipython notebook on a remote (outdated) ubuntu system

Note: Jump to the end of this post to install a basic ipython-notebook environment.

ipython notebook is one of the most impressive things I’ve seen the last few years. You can reache outstanding results wich are documented and calculated on the same page. So no more paper war and mixed solutions for my problems anymore ;). Continue reading Running ipython notebook on a remote (outdated) ubuntu system

bad IPs

After nmap-ing my on of my servers, I found some ugly IPs trying to bruteforce my SSH accounts.
After writing a script for my SSH deamon which logs bad IPs and add them to hosts.deny, I thought about all the other deamons on my server.

Not willing to write fancy scripts for all the others, I installed the nice tool fail2ban which add bad IPs to a local banlist.
It works like a charm and the best of all: There is a nice frontend call bad IPs.
A nice visualization for every server owner 😉

Erlösung des UnityMedia Technicolor tc7200

Einer meiner größten Fehler den ich je begangen habe, war zu UnityMedia (UM) zu wechseln. Geblendet von den tollen Angeboten konnte ich mir vor ca. einem Monat noch nicht ausmalen, welche Probleme da auf mich zukommen. Neben dem DS-Lite, was ja noch so gerade zu verkraften ist, wird von UM ein absolut fehlerhaftes, nicht funktionsfähiges und sicherheitstechnisch bedenklicher Modem/Router ausgeliefert: Continue reading Erlösung des UnityMedia Technicolor tc7200

Using vncviewer without typing in the password

Hi,

it is more convenient to use vncviewer without typing in the password all the time.
To do so, just create an own password-file in the following way:

After you typed in the password twice in plain text, just use the created file to connect to the server:

Greetz!