The Missing Textutils

These are mostly bash or perl scripts that should be incorporated into all real OS distributions. Feel free to use them, feel free to rewrite them, feel free to actually incorporate them to your particular favorite OS. (The license is below.)

The documentation of the tools is rather minimalistic but this is on purpose. First, employ your imagination on how to exploit the power of the tools and pipelines of them fully. Second, read the code if in doubt what they really do. (And try to write code that is self-explanatory and tools that are self-sufficient.)

antidocx (Id: antidocx,v 1.2 2010-11-02 16:07:41 bojar Exp )

'antidocx' extracts raw text from docx (MS Office 2007) documents. Rather a hack than a full solution.

autocat (Id: autocat,v 1.3 2014-01-28 18:29:23 bojar Exp )

'autocat' is a unified replacement for 'cat', 'zcat', 'bzcat'. It can also cat files given in a filelist or a glob (wildcard expression), which is useful if the glob would expand to excessive number of arguments in your shell.

avg (Id: avg,v 1.5 2011-04-27 07:03:04 bojar Exp )

'avg' prints the input and adds one more line: the average and standard deviation of the numbers in each column.

bisect (Id: bisect,v 1.1 2012-05-17 09:48:26 bojar Exp )

Given a command and an input file, finds the subsection of the file where the command fails. Strategy: 1. check if the whole file works => nothing to search for 2. bisect and find the shortest 'head' of the file that has the problem (3. bisect and find the largest 'skip' of the given head that has the problem)

bisect --help

bisect COMMAND FILE

blockize (Id: blockize,v 1.4 2011-02-02 00:21:31 bojar Exp )

Given a tab delimited file, 'blockize' returns blocks - every column on a separate line, former lines are now delimited by a blank line.

See also 'deblockize' and consider deblockize | grep | blockize.

blockize --help

blockize <stdin >stdout

blockwise (Id: blockwise,v 1.6 2006/09/08 16:45:03 bojar Exp )

Given a file of blocks (chunks of text delimited by a blank line or a specific delimiter), 'blockwise' runs an arbitrary command on each of the blocks separately.

'blockwise' can waste a lot of time, if there are lot of blocks in the input and/or the command takes long to execute!

blockwise --help

blockwise COMMAND <stdin >stdout
  --dots         ... show progress dots every loaded line
  --delim=NEW_DELIMITER
  --auto-prefix  ... use the first column of the first line in each block
                     as a prefix for the output of the block
  --deprefix ... the command will not see the first column

circumvent_blank_lines (Id: circumvent_blank_lines,v 1.1 2012-03-26 08:31:38 bojar Exp )

'circumvent_blank_lines' removes blank lines from stdin, saving their linenumbers to the given auxiliary file. Later, use --reinsert=auxfile to reinsert the blank lines at appropriate places.

circumvent_blank_lines --help

fileparse(): need a valid pathname at /home/obo/tools/vimtext/circumvent_blank_lines line 133.

colcheck (Id: colcheck,v 1.3 2005/10/10 07:33:48 bojar Exp )

'colcheck' checks whether all the given files have the same number of tab-delimited columns.

coldatasetscnt (Id: coldatasetscnt,v 1.3 2008/10/21 06:05:14 bojar Exp )

'coldatasetscnt' is like running "cut -fX | sort -u | wc -l" for every column X in the input.

coleval (Id: coleval,v 1.1 2012-05-22 17:42:32 bojar Exp )

'coleval' evals contents of the specified columns using Perl eval. Useful for numbers like 2/3.

coleval --help

Bad item --help at /home/obo/tools/vimtext/coleval line 51.

colgrep (Id: colgrep,v 1.6 2006/08/18 15:09:26 bojar Exp )

'colgrep' is just like grep (but using Perl regular expressions, not grep ones), but it checks only specified tab-delimited columns of the file.

colgrep --help

usage: colgrep 1,2,3 --re="re_to_lookup"
Options:
  --inverse   ... print lines where the RE was not found
  --skip=N    ... skip first N lines

colidx (Id: colidx,v 1.1 2005/11/04 10:27:14 bojar Exp )

'colidx' expects a string in ARG1. Then it reads the first line of its input/following arguments and scans the fields for the label. The column index containing the label is printed.

colidx --help

colidx label <stdin >stdout
Returns the column index of the column labelled with label.
Options:
  --trim   ... remove spaces 

coltest (Id: coltest,v 1.4 2009-06-16 06:21:27 bojar Exp )

'coltest' is a generalization of colgrep. An arbitrary perl code is evaluated at each line. The code cat refer to any of the columns using $1, $2, etc.

coltest --help

Unknown option: help

colwise (Id: colwise,v 1.4 2005/11/11 08:04:19 bojar Exp )

'colwise' splits the input vertically, separating the indicated column. An arbitary command is then executed on the cut-out column. The output of the command is pasted again in the (vertical) middle of the file.

The output of the command is assumed to have the same number of lines as the input.

colwise --help

colwise COLUMN COMMAND <stdin >stdout
  --delim=NEW_DELIMITER

contexts (Id: contexts,v 1.2 2010-11-02 16:07:41 bojar Exp )

'contexts' searches the stdin for the given RE and lists all contexts of it sorted by frequence.

contexts --help

Usage: contexts RE
  searches the stdin for the given RE and counts all contexts of it by frequence
Options:
--before=X   ... use before context of X chars (default:5)
--after=X    ... use after context of X chars (default:5)
--cont=X     ... use both before and after context of X chars
--contre=RE  ... use both before and after context matching the RE
--befre=RE   ... use before context matching the RE
--aftre=RE   ... use after context matching the RE
                 (the RE-contexts override the length contexts)
		 (the both-contexts override the single contexts)

deblockize (Id: deblockize,v 1.4 2010-11-02 17:06:04 bojar Exp )

Given a file of blocks (delimited by blank-lines), 'deblockize' returns a tab-delimited file. Each output line corresponds to one input block.

See also 'blockize' and consider deblockize | grep | blockize.

deblockize --help

deblockize <stdin >stdout
  --delim=''  ... a line matching this delimits blocks in input
  --field-delimiter='\t'  ... must not appear in input

greplines (Id: greplines,v 1.11 2012-11-22 18:05:44 bojar Exp )

'greplines' selects the lines identified by their position (order) in file.

greplines --help

Unknown option: help

headline (Id: headline,v 1.8 2010-07-25 20:14:44 bojar Exp )

'headline "something"' is like '(echo "something"; cat)' but with some extra options. Useful in Makefiles, as it expands \t and \n.

headline --help

Unknown option: help

hoursum (Id: hoursum,v 1.5 2013-02-12 12:58:02 bojar Exp )

'hoursum' expects every input line to be in the format "any-prefix-followed-by-a-colon: hour.min-hour.min hour.min-hour.min"

All the given time intervals are summed and the total number of hours and minutes spent on the project is appended.

html2tab (Id: html2tab,v 1.3 2009-11-16 21:07:52 bojar Exp )

'html2tab' dumps contents of HTML tables info nice tab-delimited plaintext. Requires HTML::TableContentParser.

immer (Id: immer,v 1.2 2005/10/10 07:33:49 bojar Exp )

Given a key-value pairs prints out all the pairs where for the given key the value was always ('immer', in German) the same.

Output: key\tvalue\tcount

immer --help

immer  <stdin >stdout
Options:
  --srccol=X   ... use the given col. instead of the default col 1
  --desccol=X  ... use the given col. instead of the default col 2
  --trim       ... ignore whitespace at start/end
  --minobs=X   ... do not print, if a pair occurred less than X times

indent_brackets.pl (Id: indent_brackets.pl,v 1.1 2010-10-25 12:31:59 bojar Exp )

'indent_brackets.pl' processes stdin to stdout, indenting accoring to () Useful e.g. for Collins/PennTreeBank-style trees or Moses 'python lattice format'.

insert_every (Id: insert_every,v 1.1 2006/09/15 15:26:00 bojar Exp )

'insert_every' processes stdin to stdout, writing ARG1 after every n lines

insert_every --help

insert_every text_to_insert <stdin >stdout
Options:
  --n=X   ... insert after n lines

insert_lines (Id: insert_lines,v 1.4 2010-11-02 17:04:03 bojar Exp )

'insert_lines' is a complement to 'grep -n'. Given a file of lines extracted using 'grep -n', it places the lines back to the rest at appropriate places.

If you are curious why this is useful, consider editing or processing the grepped lines separately.

insert_lines --help

Usage: insert_lines lines_to_implant < input > output

lcat (Id: lcat,v 1.8 2014-01-17 13:39:16 bojar Exp )

'lcat' is like 'cat' but the concatenated files are prefixed with the filename. The name stands for 'labelled cat'.

linemax (Id: linemax,v 1.3 2006/01/05 14:57:57 bojar Exp )

'linemax' processes all lines from stdin and calculates an agregate function on all the columns.

list2tab (Id: list2tab,v 1.5 2009-05-28 23:15:49 bojar Exp )

'list2tab' builds a 2D table out of key-key-value triples.

list2tab --help

Sample usage:
  ./list2tab.pl 1,2 5,6 3,4 [default_value]  < datafile > tablefile
The output table will have lines labelled with values seen in columns 1 and 2,
columns labelled with values from columns 5,6 and the values in the interior of the table will come from columns 3,4.

Sample input:
GIN	Praha    	5
IOL	Praha    	20
GIN	Brno     	10
IOL	Nova Paka	2

Output produced by: "list2tab 2 1 3 none"

         	GIN 	IOL
Brno     	10  	none
Nova Paka	none	2
Praha    	5   	20

loopingpaste (Id: loopingpaste,v 1.3 2014-01-17 13:39:16 bojar Exp )

'loopingpaste' is similar to paste but shorter files are repeated so the final output is as long as the longest file supplied.

loopingpaste --help

loopingpaste file1 file2 file3 ...
Options:
  --breaks  ... add a blank line whenever any of the input files is restarted

makeprefix (Id: makeprefix,v 1.4 2006/12/21 07:15:40 bojar Exp )

'makeprefix' will scan for a given regular expression and use it as a prefix for all the following lines until another match is found. Use () to mark a subpart of the RE.

Use --head for the RE default for parsing 'head *' output.

makeprefix --help

usage: makeprefix <regular_expression>
Options:
  --head ... Ignore the arg1 and use '^==> (.*) <==$', useful for parsing
             the output of the 'head' command
  --ignore-blank    ... ignore blank lines in input
  --delim=DELIMITER ... use this as the separator between the prefix line and
                        the data lines. Default: TAB
  --keep ... don't delete the line where the prefix was found

map (Id: map,v 1.17 2013-06-17 19:03:46 bojar Exp )

'map' processes stdin to stdout, changing specified columns or expressions according to a mapping given in a file or on the command line.

map --help

map mapping_file <stdin >stdout
Options:
  --srccol=X   ... source of the mapping, a column number of the mapping file
               ... default is the first column
  --tgtcol=X   ... destination of the mapping, a column number of the m. file
               ... default is the second column (deprecated name: --destcol=X)
  --mapcols=X,Y... columns of stdin to be altered, default: all
  --pattern=RE ... map all occurrences of the given pattern
               ... default is to map exactly the given col
  --trim       ... strip whitespace from data before mapping
  --default=S  ... use this value, if the mapping doesn't define anything
  --quiet      ... no warnings to stderr
  --restrict   ... suppress the while input line, if something was not mapped!
  --map=PERLARRAY ... instead of mapping_file, one can specify the mapping
                   as perl array, such as: '"green"=>1,"red"=>2'
  --utf8       ... set binmode of all streams to utf-8
Limitations: Pattern can never contain <TAB>. Mapping file is all read.

nfold (Id: nfold,v 1.8 2006/01/28 17:08:34 bojar Exp )

'nfold' performs n-fold cross launch of the supplied script on the given data.

All input lines are loaded, shuffled and n-times split into test and training data lines. (The size of the test data is 1/n-th, all the rest is used as training.) The command is launched n times with %test replaced by a temporary file containing the test data and %train replaced with the temporary training datafile.

nfold --help

usage: nfold 'command %train %test'
  --n=10      ... number of folds to perform
  --pivot=i   ... use the i-th column of input to split dataset instead of 
                  random splitting to N folds
  --limit=N   ... use only N (random) lines of input for the cross-validation
  --maxfolds=i... prepare N folds but evaluate only using first i of them
  --testsize=N... sets the number of chunks so that contains N elements

ngrams (Id: ngrams,v 1.3 2010-11-02 16:07:41 bojar Exp )

'ngrams' reads tokenized plain text and returns n-grams of the given order.

numerize_blocks (Id: numerize_blocks,v 1.5 2013-06-17 19:04:41 bojar Exp )

'numerize_blocks' adds a tab-delimited prefix to each block (blocks are delimited by a blank line) such that each block gets a distinct number.

Useful in combination with 'blockwise --auto-prefix' or 'grp'

numerize_cols (Id: numerize_cols,v 1.5 2012-01-23 08:36:16 bojar Exp )

'numerize_cols' replaces each column with a number, trying to preserve original number of characters.

This tool is useful for all those who are bad at counting columns. Use 'head -3 input | numerize_cols 1' to learn the number of the columns (starting from 1) quickly.

numsort (Id: numsort,v 1.16 2013-09-21 12:46:14 bojar Exp )

'numsort' sorts stdin properly based on numeric values (float supported, unlike in vanilla 'sort'), alphabetic values or even frequency.

numsort --help

numsort sorting-request <stdin >stdout
  --utf8  ... the input is UTF-8
  --delim=NEW_DELIMITER
  --skip=<number_of_lines_to_copy_without_sorting>
  --order=sorting-request  ... useful if sorting request starts with -
Example of sorting requests:
  1        ... sort numerically ascending by the value of column 1
  s1       ... sort numerically and interpret [kK][mM][gG][tT] as 1000
  g1       ... sort numerically and interpret [kK][mM][gG][tT] as 1024
  a-2      ... sort alphabetically desceding by the value of column 2
  d2       ... like 1 but skip value up to the first (minus followed by a) digit
  f-1,a1   ... sort by descending frequency of value of column 1 and then
               alphabetically by the same column. For instance if used on
	       a phone book, all Smiths come at the beginning but will come
	       after Simeons in the unlikely case that you phone book contained
	       the same number of Simeons as Smiths.
 it is possible to use 'n' instead of '-'

parfiles (Id: parfiles,v 1.3 2006/09/08 10:15:59 bojar Exp )

'parfiles' builds a table of filepathnames aligned by a common substring in the filename. The first argument is a regexp scanning for the identifier. All the following arguments are understood as 'globs', i.e. wildcard expressions each denoting a set of files. The wildcard are expanded, all pathnames are scanned for the identifier and all the files are aligned.

The output contains all the identifiers in the first column and the following columns are devoted to the files selected by the respective arguments.

parfiles --help

parfiles regexp glob-or-filelist-1 glob-or-fileslist-2 ...
Options:
--matching ... ignore files that do not have all the corresponding files

parshuffle (Id: parshuffle,v 1.2 2009-04-02 11:58:45 bojar Exp )

'parshuffle' shuffles lines in all input files simultaneously. All the input files thus have to share the number of lines.

parshuffle --help

Unknown option: help
usage! at /home/obo/tools/vimtext/parshuffle line 18.

pastefiles (Id: pastefiles,v 1.2 2006/10/17 14:08:54 bojar Exp )

'pastefiles' reads files specified as commandline arguments and then replaces all "#include LABEL" lines in stdin with the file content.

pastefiles --help

pastefiles --label=filename --label2=filename < input > output
Options:
  --help ... this message

pastetags (Id: pastetags,v 1.3 2009/01/02 12:23:13 bojar Exp )

'pastetags' is the complement of picktags. It replaces the contents of a tag by the input. Please note that the script is rather picky and works well only on texts originally generated by picktags. Avoid touching tab and newline characters!

You'd probably find LT XML tools much more versatile: http://www.ltg.ed.ac.uk/software/xml/index.html

pastetags --help

usage: pastetags matrixfile.xml "MMt.*?" ... < values-to-paste

outputs tab-separated file of *first* values of matching tags.
Tags assumed *non*pair.
Beware using greedy *! It would eat up also the end of the tag.

picklog (Id: picklog,v 1.6 2009-05-28 23:15:49 bojar Exp )

'picklog' reads stdin to extract useful snippets of information into a nice table.

Usage: picklog cmd1 cmd2 ... < input

Allowed commands: RE ... same as find RE find RE ... scan till the line where RE is found pick RE(what) ... pick something from the current line or the first next matching line next ... advance to the next line of input nl ... add a newline to the output let:VARNAME RE(what) ... for RE and store it in internal variable VARNAME watch:VARNAME RE(what) ... simultaneously search for RE and store it in internal variable VARNAME, whenever found count:VARNAME RE ... like watch but store just the number of lines that matched RE so far print:VARNAME ... print internal variable VARNAME

A 'nl' is always added at the end of the commands.

The commands are looped, as long as there are some input lines to read.

The variables are useful for finding the last something before something else or for swapping columns in the output.

picklog --help


pickre (Id: pickre,v 1.7 2011-11-24 16:55:21 bojar Exp )

'pickre' is extremely useful to collect specific information from every line. The $1 output of a given regexp is prepended (tab-delimited) at the beginning of every line.

pickre --help

Unknown option: help

picktags (Id: picktags,v 1.5 2008/12/12 10:39:46 bojar Exp )

'picktags' is a lazy hack for lazy programmers. It extracts values from specified SGML tags, without actually checking anything about the SGML.

You'd probably find LT XML tools much more versatile: http://www.ltg.ed.ac.uk/software/xml/index.html

picktags --help

usage: picktags "MMt.*?" ...

outputs tab-separated file of *first* values of matching tags.
Tags assumed *non*pair.
Beware using greedy *! It would eat up also the end of the tag.

prefix (Id: prefix,v 1.5 2006/08/04 22:15:00 bojar Exp )

'prefix' prepends every line with ARG1.

prefix --help

Unknown option: help

quantize (Id: quantize,v 1.5 2013-07-25 23:19:18 bojar Exp )

'quantize' is a first step to histogram. It replaces every value in the specified column with the label of the "box" where the value fits.

quantize --help

usage: quantize colindex boxesdesc < infile > outfile
Options:  
--skip=N    ... dump first N lines without quantizing
--min=X     ... make the label of the first box to to start at X
--max=X     ... make the label of the last box to end at X
--seq=A,STEP,B  ... set boxes delimiters to start at A and stop at B, stepping
                    by STEP. Eg.: --seq=1,3,20 is equivalent to specifying
                    boxesdesc as: 1,4,7,10,13,16,19,20
--seqprec=P ... the automatic boxes should be labelled with the specified
                precision, e.g.:
                --seq=0,0.1,1 --seqprec=1  produces 0.1,0.2,...0.9,1.0
--discrete  ... the values are discrete, so the boxes should be labelled
                1 - 10, 11 - 20, 21 - 30 and not 1 - 10, 10 - 20, 20 - 30
--leftist   ... the values are discrete, so the boxes should be labelled
                0 - 9, 10 - 19, 20 - 29 and not 1 - 10, 10 - 20, 20 - 30
--method=even|histogram  ... automatically guess box boundaries
--boxes=N   ... number of boxes to use when automatically guessing

recut (Id: recut,v 1.6 2013-06-17 19:04:41 bojar Exp )

'recut' is a simple 'cut' but unlike 'cut' allows for repetitions and reordering of columns.

recut --help

Bad item --help at /home/obo/tools/vimtext/recut line 56.

remove_blank_lines (Id: remove_blank_lines,v 1.2 2005/10/10 07:33:50 bojar Exp )

'remove_blanks_lines' does just what you expect and nothing more. Easier to write than "grep -v '^$'".

remove_singleton_lines (Id: remove_singleton_lines,v 1.1 2013-04-10 07:56:52 bojar Exp )

Given a column number and input *sorted by that column*, 'remove_singleton_lines' removes all "blocks" of lines that appear just once.

revfields (Id: revfields,v 1.2 2005/10/10 07:33:50 bojar Exp )

'revfields' reverses the order of fields on every line.

To sort files by their extensions (suffixes), you might use "ls -1 | revfields --delim=. | sort | revfields --delim=."

To cut the last column of a file, use "revfields | cut -f1".

round (Id: round,v 1.6 2010-10-31 19:14:53 bojar Exp )

'round' rounds all the numbers it can find to a specified precision (given as ARG1).

If prec &lg; 10, rounds up to 'prec' decimal places, if prec &gt;= 10, rounds to whole "precs" (such as whole tens, thousands...).

sample_nth (Id: sample_nth,v 1.2 2007/05/18 08:25:29 bojar Exp )

'sample_nth' reads stdin (loads whole in!) and produces FILECOUNT files each with N lines taking evenly selected lines from the input. The remaining lines are printed to stdout. Use '%i' in outname as the placeholder for filecount

sample_nth --help

sample_nth N OUTFILENAME < input > remaining_lines
...reads stdin (loads whole in!) and produces FILECOUNT files each with N lines
taking evenly selected lines from the input. The remaining lines are printed to
stdout.
Use '%i' in outname as the placeholder for filecount

Options:
  --files=FILECOUNT ... default: 1

see (Id: see,v 1.2 2005/10/10 07:33:50 bojar Exp )

'see' is nearly equivalent to "sort | uniq -c". Nearly, because some 'uniq's tend to use space instead of tab as the delimiter.

seqcheck (Id: seqcheck,v 1.3 2006/04/11 12:54:04 bojar Exp )

'seqcheck' checks if the stdin contains a non-interrupted sequence of rising integers.

setcompare (Id: setcompare,v 1.2 2013-10-28 22:52:51 bojar Exp )

Given any number of files as args, 'setcompare' analyzes the intersections of lines.

shuffle (Id: shuffle,v 1.2 2005/10/10 07:33:50 bojar Exp )

'shuffle' is the missing complement to 'sort'. Correctly shuffles the input lines.

skip (Id: skip,v 1.4 2007/12/18 03:06:17 bojar Exp )

'skip' is the missing complement to 'head'. Moreover, it can also redirect the skipped lines to a file.

skip --help

skip <number_of_lines>
Skips the specified number of lines and 'cat's the rest.
Options:
  --save=filename  ... save the skipped lines here

skipbetween (Id: skipbetween,v 1.9 2011-04-29 13:32:43 bojar Exp )

'skipbetween' removes specified section(s) of stdin. The sections are identified by a beginning regular expression and ending regular expression. As an option, a specified file can be inserted at the place of the removed section.

Alternatively, skipbetween can be used to select only the marked sections.

skipbetween --help

Usage: skipbetween from_RE to_RE
...will skip all lines from stdin, that are between lines matching
   from_RE and to_RE (included)
...can skip several such blocks
--inverse   print only the lines between.
--until     stop (and exit) right after the first found string
--insert=filename ... replace the skipped section with the contents of the file
                      (not really compatible with --inverse)
--exclude-markers ... useful with --inverse

skiplinecmd (Id: skiplinecmd,v 1.4 2005/10/10 07:33:50 bojar Exp )

'skiplinecmd' is used to launch a specified command in a pipeline but circumvent it for the first --skip=X lines. Useful for all the command that operate on pipe but do not support --skip=X themselves.

skipseen (Id: skipseen,v 1.2 2008/01/25 12:50:54 bojar Exp )

'skipseen' is like uniq but the lines do not need to immediately follow each other to be deleted. In other words: only the first occurrence of any line is printed.

solve_first (Id: solve_first,v 1.11 2007/03/23 06:08:36 bojar Exp )

Given a regexp, all the matching lines are put in front of the file and all the rest is put below. It's like running 'grep' and 'grep -v'. Optionally a blank line is inserted to separate the blocks.

This tool is extremely handy when manually editing any text-based database.

solve_first --help

usage: solve_first <regular_expression>
options:
  --sort ... the top lines are sorted by $1 of the reg. expression
  --delim ... add a blank line between the two parts
  --col=i ... check only the given column to check the reg. expression
  --inverse ... put above if not matches
  --blockwise ... match whole blocks of input instead of lines
  --just-matching ... do not print nonmatching blocks/lines
  --insens ... case insensitive
  --skip=<number_of_lines_to_blindly_copy>

sparse_to_c4.5 (Id: sparse_to_c4.5,v 1.4 2007/09/25 04:45:43 bojar Exp )

'sparse_to_c4.5' converts a sparse matrix representation (from stdin) to input suitable for c4.5. The output is stored in the files ARG1.data and ARG1.names.

sparse_to_c4.5 --help

usage: sparse_to_c4.5 baseoutputfilename
Converts sparse matrix input into data suitable for c4.5
Input line sample:
  answer     group1/var1:value1    group2/var3:valueB
Options:
--test=filename  ... build unseen test dataset
--coldelim=str   ... the delimiter between colums/items on each line
--outdelim=str   ... the delimiter to use in output.data file
--valdelim=str   ... the delimiter between varname and value
--groupdelim=str   ... the delimiter between groupname and varname
--use=group1 --use=group2 ... list of group names of attributes to be used
--usemore=group1,group2  ... same as --use=group1 --use=group2
--nondirected   ... the first column is no special 'answer' attribute
--defvalue=str  ... the default value if there is no valdelim found in an item
--blankvalue=str  ... the value if the column is not metioned in a line
--ignore-duplicit-values  ... keep silent if more (equal) values are assigned
                              to a column

split_at_colchange (Id: split_at_colchange,v 1.2 2006/06/16 08:36:05 bojar Exp )

Given a column number, 'split_at_colchange' adds an empty line whenever the value in the column changes. Useful as a preprocessing before 'blockwise'.

split_even (Id: split_even,v 1.6 2008/10/29 14:58:35 bojar Exp )

'split_even' reads stdin saving the lines to N output files so that each of the files will contain about the same number of lines.

split_even --help

Unknown option: help

split_to_files (Id: split_to_files,v 1.1 2009-09-20 14:16:42 bojar Exp )

Given a column number, 'split_to_files' uses the value of the column to create a separate file for all the lines. Note that we accumulate open files until the end of the input stream, so we may easily run out of file descriptors.

split_train_test (Id: split_train_test,v 1.1 2006/06/16 08:36:05 bojar Exp )

'split_train_test' reads all input lines, shuffles them and then produces a training file and a test file

split_train_test --help

Unknown option: help
split_train_test outtrainingfile outtestfile < lines
Options:
  --limit=N  ... shuffle all lines, but use only first N
  --parts=N  ... use 1/N of lines as the test data

suffix (Id: suffix,v 1.5 2009-04-02 11:58:45 bojar Exp )

'suffix' appends ARG1 at the end of every line.

suffix --help

Unknown option: help
usage: suffix <prefix>  < infile   > outfile at /home/obo/tools/vimtext/suffix line 17.

tab2list (Id: tab2list,v 1.4 2009-04-02 11:58:45 bojar Exp )

'tab2list' is a complement to list2tab. It reads a 2-dimensional table and produces a list in the form: row label - column label - value.

Think about using tab2list on two separate tables, concatenating the lists and then formatting it to a single table with list2tab.

tab2list --help

usage: tab2list < infile > outfile
  Options:
    --keycols=X  ... How many cols from left are to be kept (default: 1)
    --no-headline ... The first line is content line already, there are no
                      column names.

tdf_to_c4.5 (Id: tdf_to_c4.5,v 1.6 2006/05/28 20:19:51 bojar Exp )

'tdf_to_c4.5' converts a tab-delimited file (from stdin) to input suitable for c4.5. The output is stored in the files ARG1.data and ARG1.names.

tdf_to_c4.5 --help

usage: tdf_to_c4.5 baseoutputfilename
Converts tabbed input into data suitable for c4.5
Options:
--test=filename  ... build unseen test dataset
--baseline  ... for the independent data set estimate the baseline (assign most
                frequent) and oracle (if only never seen data are wrong) rates

tdf_to_xml (Id: tdf_to_xml,v 1.7 2013-09-21 12:46:51 bojar Exp )

'tdf_to_xml' converts stdin to stdout, wrapping tab-delimited data with HTML/XML tags: &lt;table&gt;&lt;tr num="rowindex"&gt;&lt;td num="colindex"&gt;...

tdf_to_xml --help

tdf_to_html < tab-delimited > table.xml
Options:
  --border=i  ... border width
  --noescape  ... do not escape <>&
  --format=html|docbook  ... tags are name table/tr/td or tbody/row/entry

tpr_tnr (Id: tpr_tnr,v 1.1 2006/04/09 22:40:16 bojar Exp )

'tpr_tnr' reads a column in the input, calculating true positive rate and true negative rate at various cut-offs. The column is expected to contain 1 in case of a positive example and 0 in case of negative example. If the lines are in perfect order, all positives come before all negatives. Use this script on files sorted non-perfectly to examine the accuracy of the sorting compared to the golden positive-negative dichotomy.

tpr_tnr --help

usage: tpr_tnr <col_index>  < infile   > outfile at /home/obo/tools/vimtext/tpr_tnr line 26.

transpose (Id: transpose,v 1.2 2005/10/10 07:33:50 bojar Exp )

'transpose' swaps rows and columns in the given tab-delimited table.

tt (Id: tt,v 1.10 2006/12/04 04:44:01 bojar Exp )

'tt' pads tab-delimited file with spaces so that it looks nicely.

Please read the source code for more options.

tuples (Id: tuples,v 1.2 2005/10/10 07:33:50 bojar Exp )

'tuples' makes tuples: every n consecutive lines joined on 1 line, delimited with a tab.

ua (Id: ua,v 1.5 2014-01-07 09:56:01 bojar Exp )

'ua' strips off all accents over all latin letters. The name stands for 'unicode to ascii'.

unlcat (Id: unlcat,v 1.3 2010-10-27 15:03:14 bojar Exp )

'unlcat' reverses what lcat does. It reads (the first) column of the file and appends all lines to a file with the same name as specified in the column.

unlcat --help

unlcat < input 
Produces a separate file for all values seen in column 1.
Options:
  --col=N   ... use column N instead of 1
  --changename=suffix | prefix*suffix   ... modify the column value before
                                            creating the file

unsee (Id: unsee,v 1.3 2006/03/04 09:04:01 bojar Exp )

'unsee' reverses the output see, i.e. produces each line that number times as written in the first column.

unwrap (Id: unwrap,v 1.1 2012-10-07 23:12:02 bojar Exp )

'unwrap' concatenates all lines, inserting a space in between. Only blank lines are preserved. write than "grep -v '^$'".

unziplines (Id: unziplines,v 1.1 2007/08/02 07:05:16 bojar Exp )

'unziplines' reads stdin and produces FILECOUNT files with lines "unzipped". If you combine all output files in the original order using 'ziplines', you get the original input stream. Use '%i' in outname as the placeholder for filecount

unziplines --help

unziplines N OUTFILENAME < input > remaining_lines
'unziplines' reads stdin and produces FILECOUNT files with lines 'unzipped'. If
you combine all output files in the original order using 'ziplines', you get
the original input stream.
Use '%i' in outname as the placeholder for filecount.

update_cols (Id: update_cols,v 1.5 2005/10/10 07:33:50 bojar Exp )

'update_cols' updates stdin and produces stdout according to the given update file. All files are expected to be tab-delimited.

update_cols --help

update_cols update_file <stdin >stdout
Options:
  --paste= ... comma delimited list of indices of cols in the pasted (update)
               file, with respect to the columns in the main file
	       Use 0 to ignore the column.
  --keys=  ... comma delimited set of col indices from pasted that serve as keys
               to decide whether the update is to be done on the current line
  --trim      ... strip whitespace from data before mapping
Limitations: Update file is read to the memory.
Example:
  --keys=1 --paste=0,3,4
  If the value of the column 1 in main input is equal to a value in the column 1
  of a line in the update file, then the line of the update file is used as
  follows: the first column is not used, the second column is used to replace
  the column 3 and the third column is used to replace the column 4 of the input
  line.

weakly_correlated (Id: weakly_correlated,v 1.2 2006/04/14 10:48:50 bojar Exp )

'weakly_correlated Xcol Ycol OutCol < input > output' replaces OutCol in the given data with a numeric value expressing the distance of the point x,y (from Xcol and Ycol) from the diagonal. The diagonal is the diagonal of the smallest rectangle containing all the data.

weakly_correlated --help

Unknown option: help
usage: weakly_correlated Xcol Ycol OutCol < input > output at /home/obo/tools/vimtext/weakly_correlated line 24.

width (Id: width,v 1.5 2009-08-17 11:08:29 bojar Exp )

'width' returns the maximum number of characters in one line of stdin/args. Tabs are counted as 1 char!

xml_to_tdf (Id: xml_to_tdf,v 1.3 2010-11-02 16:07:41 bojar Exp )

'xml_to_tdf' converts <table></table> elements to plain tab-delimited text file. If several <table> elements are found, blank line is put in between. Of course, neither nested tables nor span are supported. Tabs in input are replaced with space. All XML tags are deleted. Basic XML entities are expanded.

ziplines (Id: ziplines,v 1.10 2013-06-17 19:04:41 bojar Exp )

Given any number of files as args, 'ziplines' produces a file where first comes a line from the first file, then a line from the second, etc. It produces blank lines if any of the files is shorter, so the output is as long as the longest file.

zwc (Id: zwc,v 1.2 2012-01-23 08:36:16 bojar Exp )

'zwc' is a word counting utility (wc) on (normal and) compressed files. Allows also counting words in each column separately. Allows also grouping using a column value (but always within each file). Allows also quick estimate based on the beginning of the file.

License

The tools available on this page are distributed under any license at your option if the following conditions are met:

Please keep a reference to the original author (i.e. me, Ondrej Bojar) in any derivative work. If reasonable, make a web-page link here.


[Ond\xF8ej Bojar] [Mail Me] [Finger Me] $Date: 2015/08/13 15:51:34 $