Tear down the walls of secretiveness!

# Extract plain body text from tex document

I needed to post just the body text as plain ASCII for a submission to a form and I thought well, that should be easy. But in fact, I struggled a lot to get even decent results. Thus, I decided to post this little how-to, to get the best version of plain text of the body text. I am only interested in the body text which means, that figures, footnotes, equations, headers, page numbers, … should be not be included in the final result. Thus the first section explains some ways to get rid of text/equations/figures/… which do not belong to the body text, and the second section explains the plain text production.

## Getting rid of unwanted text

The goal of this section is, to remove all passages with text/figures/tables/etc. that do not belong to the body text.

### Pruning environments

First I want to get rid of the beforementioned footnotes, equations, etc. Since a lot of stuff we want to get rid of is written in environments like \begin{} ... \end{} , a useful package called comment already exists. It can be put in the preamble like this:

The example removes the output defined in the given environments (figure, table, …) from the document.

Note: The references cannot longer be resolved and thus, many ?? will appear in the final output. But we are just interested in the body text, right?

### Pruning commands

Some commands (footnote, bibliography, …) produce extra text in the document, which belongs not to the real body text. But again, if removed referenced or cited entries will be lost and appear as  ?? in the final output. The following example gives just a snapshot of commands one wants to prune. These are very project and also document class depended, and might be altered. So the final approach is, to get rid of the commands be just renewing the definition by an empty one like {} . If the original command has a number # of parameters, these number needs to be given in the redefinition in  [] .

### Remove header, footer, page numbers, …

The pagestyle command can be applied as follows, to remove header and footer. This command needs set into the preamble:

## Produce plain text output from tex

This section gives examples of how to convert a tex document to plain text.

### Flattening the project structure

I have experienced many problems with hierarchical project structures, where the main tex document is split into separate files and included with commands like input, include or import. Thus, I’ve adapted a python script called master2single.py to produce a complete flattened tex file, where all includes, etc. are resolved. It can be executed as follows:

Note 1: While some compiler complain about some wired comments, it is safer to remove all comments on the flattened output via the -c  option.

Note 2: The import command is for me the most convenient version because it seems to have no limitations regarding the project’s folder depth.

### Solution 1: pandoc

pandoc is a grate too when it comes to markup language parsing and other text operations. And thus, pandoc does not really produces plain text from the tex document, but a markdown version of it. This is very nice if you want to post the whole document on a web page, but it disrespects the exclusion of environments via the comment  package. Thus, equations and tables, but not figures, are included in the output:

### Solution 2: latex2rtf (not really)

For very simple documents, latex2rtf, which comes with the standard latex installation on most of your OS flavours, might be a solution. Unfortunately, it cannot handle most of the simple latex commands and thus results an error.

### Solution 3: Convert PDF to plain text (best output)

It sounds counter-intuitive, but first producing a PDF from the tex file and then convert the PDF to plain text seems to produce the most satisfying results. Program solutions are ebook-convert  from Calibre or Pdftotext. I’ve only used Pdftotex, which produces a very satisfying output:

Note 1: The file needs of course not to be flattened with this solution.

Note 2: The  -layout argument preserves the sections’ layout and thus, produces even better results.

### Other solutions

There might exist a bunch of other solutions out there, but some of them are outdated, are hard to handle or I haven’t found them yet. Here is an incomplete list of other tools and explanations which might be helpful:

# Back from Brazil

We struggled and fought and reached the 4th place at the RoboCup@Home competition. More posts are coming again from now on.

# RoboCup@Home in Magdeburg

I joined my research group quite a few month ago with the ambitious goal of developing multi-robotic architectures for heterogeneous systems. On the picture you can see me with our mini-robots BeBot on the RoboCup in Magdeburg last week. We brought our devices and the household robot BIRON together for the first time to show some demonstration regarding the topic “Smart Home”.

# Running ipython notebook on a remote (outdated) ubuntu system

Note: Jump to the end of this post to install a basic ipython-notebook environment.

ipython notebook is one of the most impressive things I’ve seen the last few years. You can reache outstanding results wich are documented and calculated on the same page. So no more paper war and mixed solutions for my problems anymore ;). Continue reading Running ipython notebook on a remote (outdated) ubuntu system

After nmap-ing my on of my servers, I found some ugly IPs trying to bruteforce my SSH accounts.
After writing a script for my SSH deamon which logs bad IPs and add them to hosts.deny, I thought about all the other deamons on my server.

Not willing to write fancy scripts for all the others, I installed the nice tool fail2ban which add bad IPs to a local banlist.
It works like a charm and the best of all: There is a nice frontend call bad IPs.
A nice visualization for every server owner 😉

# Erlösung des UnityMedia Technicolor tc7200

Einer meiner größten Fehler den ich je begangen habe, war zu UnityMedia (UM) zu wechseln. Geblendet von den tollen Angeboten konnte ich mir vor ca. einem Monat noch nicht ausmalen, welche Probleme da auf mich zukommen. Neben dem DS-Lite, was ja noch so gerade zu verkraften ist, wird von UM ein absolut fehlerhaftes, nicht funktionsfähiges und sicherheitstechnisch bedenklicher Modem/Router ausgeliefert: Continue reading Erlösung des UnityMedia Technicolor tc7200

# it’s OWL – Summerschool

Eine wunderschöne und sehr fordernde Woche durfte ich letztens bei der Teilnahme der “it’s OWL Summerschool” erfahren. Das Leitthema war “Industrie 4.0”, “Cyberphysical Systems” und “Self-X”. Auch wenn diese Themen noch weit in der Zukunft (> 2020) liegen, so ist die Region OWL ein wichtiger Trendsetter und Impulsgeber für diese neuen Technologien.
Pressemitteilung

# Matlab: Build-In vs. MEX vs. “naiv” loops

Ich wurde vor kurzem nach einem Minimalbeispiel für eine MEX-Funktion in MATLAB gefragt, da ich so davon geprahlt habe. Wer MEX nicht kennt, dass sind “MATLAB EXecutables”, welche in C/C++ oder Fortran geschrieben werden können. Sofern man diese dann mit dem MEX-Compiler übersetzen kann, ruft man sie straight-forward aus einem Matlab-Skript auf. Dies führt unter Umständen zu einem gewaltigen Geschwindigkeitsschub, in Bezug auf eine gleiche Implementierung in MATLAB.

Als Minimalbeispiel habe ich mir die Spaltensummenfunktion überlegt, welche alle Zeileneinträge einer Spalte subsummiert. In Matlab ist diese bereits implementiert und kann wie folgt genutzt werden: Continue reading Matlab: Build-In vs. MEX vs. “naiv” loops

# LaTeX symbol classifier

I am writing my masterthesis right now and after a few non-everyday equations, I was tired of searching for the right LaTeX command in Google to generate the proper math symbol.
But then I remembered a tool called “LaTeX symbol classifier” which link I have stored deep down in my bookmarks. With this, you just need to print the symbol with your mouse, and you will be pleased will the most probable commands in LaTeX.

->LaTeX symbol classifier<-

This is a tool which shows IT at-it’s-best ;-). I’m wondering if they use HMMs for recognising like the Hanzi recognisers do.