These are mostly bash or perl scripts that should be incorporated into all real OS distributions. Feel free to use them, feel free to rewrite them, feel free to actually incorporate them to your particular favorite OS. (The license is below.)
The documentation of the tools is rather minimalistic but this is on purpose. First, employ your imagination on how to exploit the power of the tools and pipelines of them fully. Second, read the code if in doubt what they really do. (And try to write code that is self-explanatory and tools that are self-sufficient.)
'antidocx' extracts raw text from docx (MS Office 2007) documents. Rather a hack than a full solution.
'autocat' is a unified replacement for 'cat', 'zcat', 'bzcat'. It can also cat files given in a filelist or a glob (wildcard expression), which is useful if the glob would expand to excessive number of arguments in your shell.
'avg' prints the input and adds one more line: the average and standard deviation of the numbers in each column.
Given a tab delimited file, 'blockize' returns blocks - every column on a separate line, former lines are now delimited by a blank line.
See also 'deblockize' and consider deblockize | grep | blockize.
blockize --helpblockize <stdin >stdout
Given a file of blocks (chunks of text delimited by a blank line or a specific delimiter), 'blockwise' runs an arbitrary command on each of the blocks separately.
'blockwise' can waste a lot of time, if there are lot of blocks in the input and/or the command takes long to execute!
blockwise --helpblockwise COMMAND <stdin >stdout --dots ... show progress dots every loaded line --delim=NEW_DELIMITER --auto-prefix ... use the first column of the first line in each block as a prefix for the output of the block --deprefix ... the command will not see the first column
'colcheck' checks whether all the given files have the same number of tab-delimited columns.
'coldatasetscnt' is like running "cut -fX | sort -u | wc -l" for every column X in the input.
'colgrep' is just like grep (but using Perl regular expressions, not grep ones), but it checks only specified tab-delimited columns of the file.
colgrep --helpusage: colgrep 1,2,3 --re="re_to_lookup" Options: --inverse ... print lines where the RE was not found --skip=N ... skip first N lines
'colidx' expects a string in ARG1. Then it reads the first line of its input/following arguments and scans the fields for the label. The column index containing the label is printed.
colidx --helpcolidx label <stdin >stdout Returns the column index of the column labelled with label. Options: --trim ... remove spaces
'coltest' is a generalization of colgrep. An arbitrary perl code is evaluated at each line. The code cat refer to any of the columns using $1, $2, etc.
coltest --helpUnknown option: help
'colwise' splits the input vertically, separating the indicated column. An arbitary command is then executed on the cut-out column. The output of the command is pasted again in the (vertical) middle of the file.
The output of the command is assumed to have the same number of lines as the input.
colwise --helpcolwise COLUMN COMMAND <stdin >stdout --delim=NEW_DELIMITER
'contexts' searches the stdin for the given RE and lists all contexts of it sorted by frequence.
contexts --helpUsage: contexts RE searches the stdin for the given RE and counts all contexts of it by frequence Options: --before=X ... use before context of X chars (default:5) --after=X ... use after context of X chars (default:5) --cont=X ... use both before and after context of X chars --contre=RE ... use both before and after context matching the RE --befre=RE ... use before context matching the RE --aftre=RE ... use after context matching the RE (the RE-contexts override the length contexts) (the both-contexts override the single contexts)
Given a file of blocks (delimited by blank-lines), 'deblockize' returns a tab-delimited file. Each output line corresponds to one input block.
See also 'blockize' and consider deblockize | grep | blockize.
deblockize --helpdeblockize <stdin >stdout --delim='' ... a line matching this delimits blocks in input --field-delimiter='\t' ... must not appear in input
'greplines' selects the lines identified by their position (order) in file.
greplines --helpUnknown option: help
'headline "something"' is like '(echo "something"; cat)' but with some extra options. Useful in Makefiles, as it expands \t and \n.
headline --helpUnknown option: help
'hoursum' expects every input line to be in the format "any-prefix-followed-by-a-colon: hour.min-hour.min hour.min-hour.min"
All the given time intervals are summed and the total number of hours and minutes spent on the project is appended.
'html2tab' dumps contents of HTML tables info nice tab-delimited plaintext. Requires HTML::TableContentParser.
Given a key-value pairs prints out all the pairs where for the given key the value was always ('immer', in German) the same.
immer --helpimmer <stdin >stdout Options: --srccol=X ... use the given col. instead of the default col 1 --desccol=X ... use the given col. instead of the default col 2 --trim ... ignore whitespace at start/end --minobs=X ... do not print, if a pair occurred less than X times
'indent_brackets.pl' processes stdin to stdout, indenting accoring to () Useful e.g. for Collins/PennTreeBank-style trees or Moses 'python lattice format'.
'insert_every' processes stdin to stdout, writing ARG1 after every n lines
insert_every --helpinsert_every text_to_insert <stdin >stdout Options: --n=X ... insert after n lines
'insert_lines' is a complement to 'grep -n'. Given a file of lines extracted using 'grep -n', it places the lines back to the rest at appropriate places.
If you are curious why this is useful, consider editing or processing the grepped lines separately.
insert_lines --helpUsage: insert_lines lines_to_implant < input > output
'lcat' is like 'cat' but the concatenated files are prefixed with the filename. The name stands for 'labelled cat'.
'linemax' processes all lines from stdin and calculates an agregate function on all the columns.
'list2tab' builds a 2D table out of key-key-value triples.
list2tab --helpSample usage: ./list2tab.pl 1,2 5,6 3,4 [default_value] < datafile > tablefile The output table will have lines labelled with values seen in columns 1 and 2, columns labelled with values from columns 5,6 and the values in the interior of the table will come from columns 3,4. Sample input: GIN Praha 5 IOL Praha 20 GIN Brno 10 IOL Nova Paka 2 Output produced by: "list2tab 2 1 3 none" GIN IOL Brno 10 none Nova Paka none 2 Praha 5 20
'loopingpaste' is similar to paste but shorter files are repeated so the final output is as long as the longest file supplied.
loopingpaste --helploopingpaste file1 file2 file3 ... Options: --breaks ... add a blank line whenever any of the input files is restarted
'makeprefix' will scan for a given regular expression and use it as a prefix for all the following lines until another match is found. Use () to mark a subpart of the RE.
Use --head for the RE default for parsing 'head *' output.
makeprefix --helpusage: makeprefix <regular_expression> Options: --head ... Ignore the arg1 and use '^==> (.*) <==$', useful for parsing the output of the 'head' command --ignore-blank ... ignore blank lines in input --delim=DELIMITER ... use this as the separator between the prefix line and the data lines. Default: TAB --keep ... don't delete the line where the prefix was found
'map' processes stdin to stdout, changing specified columns or expressions according to a mapping given in a file or on the command line.
map --helpmap mapping_file <stdin >stdout Options: --srccol=X ... source of the mapping, a column number of the mapping file ... default is the first column --tgtcol=X ... destination of the mapping, a column number of the m. file ... default is the second column (deprecated name: --destcol=X) --mapcols=X,Y... columns of stdin to be altered, default: all --pattern=RE ... map all occurrences of the given pattern ... default is to map exactly the given col --trim ... strip whitespace from data before mapping --default=S ... use this value, if the mapping doesn't define anything --quiet ... no warnings to stderr --restrict ... suppress the while input line, if something was not mapped! --map=PERLARRAY ... instead of mapping_file, one can specify the mapping as perl array, such as: '"green"=>1,"red"=>2' Limitations: Pattern can never contain <TAB>. Mapping file is all read.
'nfold' performs n-fold cross launch of the supplied script on the given data.
All input lines are loaded, shuffled and n-times split into test and training data lines. (The size of the test data is 1/n-th, all the rest is used as training.) The command is launched n times with %test replaced by a temporary file containing the test data and %train replaced with the temporary training datafile.
nfold --helpusage: nfold 'command %train %test' --n=10 ... number of folds to perform --pivot=i ... use the i-th column of input to split dataset instead of random splitting to N folds --limit=N ... use only N (random) lines of input for the cross-validation --maxfolds=i... prepare N folds but evaluate only using first i of them --testsize=N... sets the number of chunks so that contains N elements
'ngrams' reads tokenized plain text and returns n-grams of the given order.
'numerize_blocks' adds a tab-delimited prefix to each block (blocks are delimited by a blank line) such that each block gets a distinct number.
Useful in combination with 'blockwise --auto-prefix' or 'grp'
'numerize_cols' replaces each column with a number, trying to preserve original number of characters.
This tool is useful for all those who are bad at counting columns. Use 'head -3 input | numerize_cols 1' to learn the number of the columns (starting from 1) quickly.
'numsort' sorts stdin properly based on numeric values (float supported, unlike in vanilla 'sort'), alphabetic values or even frequency.
numsort --helpnumsort sorting-request <stdin >stdout --utf8 ... the input is UTF-8 --delim=NEW_DELIMITER --skip=<number_of_lines_to_copy_without_sorting> --order=sorting-request ... useful if sorting request starts with - Example of sorting requests: 1 ... sort numerically ascending by the value of column 1 a-2 ... sort alphabetically desceding by the value of column 2 d2 ... like 1 but skip value up to the first (minus followed by a) digit f-1,a1 ... sort by descending frequency of value of column 1 and then alphabetically by the same column. For instance if used on a phone book, all Smiths come at the beginning but will come after Simeons in the unlikely case that you phone book contained the same number of Simeons as Smiths. it is possible to use 'n' instead of '-'
'parfiles' builds a table of filepathnames aligned by a common substring in the filename. The first argument is a regexp scanning for the identifier. All the following arguments are understood as 'globs', i.e. wildcard expressions each denoting a set of files. The wildcard are expanded, all pathnames are scanned for the identifier and all the files are aligned.
The output contains all the identifiers in the first column and the following columns are devoted to the files selected by the respective arguments.
parfiles --helpparfiles regexp glob-or-filelist-1 glob-or-fileslist-2 ... Options: --matching ... ignore files that do not have all the corresponding files
'parshuffle' shuffles lines in all input files simultaneously. All the input files thus have to share the number of lines.
parshuffle --helpUnknown option: help usage! at /home/obo/tools/vimtext/parshuffle line 18.
'pastefiles' reads files specified as commandline arguments and then replaces all "#include LABEL" lines in stdin with the file content.
pastefiles --helppastefiles --label=filename --label2=filename < input > output Options: --help ... this message
'pastetags' is the complement of picktags. It replaces the contents of a tag by the input. Please note that the script is rather picky and works well only on texts originally generated by picktags. Avoid touching tab and newline characters!
You'd probably find LT XML tools much more versatile: http://www.ltg.ed.ac.uk/software/xml/index.html
pastetags --helpusage: pastetags matrixfile.xml "MMt.*?" ... < values-to-paste outputs tab-separated file of *first* values of matching tags. Tags assumed *non*pair. Beware using greedy *! It would eat up also the end of the tag.
'picklog' reads stdin to extract useful snippets of information into a nice table.
Usage: picklog cmd1 cmd2 ... < input
Allowed commands: RE ... same as find RE find RE ... scan till the line where RE is found pick RE(what) ... pick something from the current line or the first next matching line next ... advance to the next line of input nl ... add a newline to the output let:VARNAME RE(what) ... for RE and store it in internal variable VARNAME watch:VARNAME RE(what) ... simultaneously search for RE and store it in internal variable VARNAME, whenever found count:VARNAME RE ... like watch but store just the number of lines that matched RE so far print:VARNAME ... print internal variable VARNAME
A 'nl' is always added at the end of the commands.
The commands are looped, as long as there are some input lines to read.
The variables are useful for finding the last something before something else or for swapping columns in the output.
'pickre' is extremely useful to collect specific information from every line. The $1 output of a given regexp is prepended (tab-delimited) at the beginning of every line.
pickre --helpUnknown option: help pickre --re=what_to_pick Searches for a token on every line and precedes the line with an extra column containing the token (if found). Options: --collect ... delimit with space all collected tokens --uniq ... ignore number of occurrences and order of found tokens (implies collect) --pick ... print only lines where something was indeed found --delim=' ' ... join multiple tokens of output with this delimiter --cut ... don't append the original line (used to be called just-output --col=N ... pick only from column N (numbered from 1)
'picktags' is a lazy hack for lazy programmers. It extracts values from specified SGML tags, without actually checking anything about the SGML.
You'd probably find LT XML tools much more versatile: http://www.ltg.ed.ac.uk/software/xml/index.html
picktags --helpusage: picktags "MMt.*?" ... outputs tab-separated file of *first* values of matching tags. Tags assumed *non*pair. Beware using greedy *! It would eat up also the end of the tag.
'prefix' prepends every line with ARG1.
prefix --helpUnknown option: help
'quantize' is a first step to histogram. It replaces every value in the specified column with the label of the "box" where the value fits.
quantize --helpusage: quantize colindex boxesdesc < infile > outfile Options: --skip=N ... dump first N lines without quantizing --min=X ... make the label of the first box to to start at X --max=X ... make the label of the last box to end at X --seq=A,STEP,B ... set boxes delimiters to start at A and stop at B, stepping by STEP. Eg.: --seq=1,3,20 is equivalent to specifying boxesdesc as: 1,4,7,10,13,16,19,20 --seqprec=P ... the automatic boxes should be labelled with the specified precision, e.g.: --seq=0,0.1,1 --seqprec=1 produces 0.1,0.2,...0.9,1.0 --discrete ... the values are discrete, so the boxes should be labelled 1 - 10, 11 - 20, 21 - 30 and not 1 - 10, 10 - 20, 20 - 30 --method=even|histogram ... automatically guess box boundaries --boxes=N ... number of boxes to use when automatically guessing
'recut' is a simple 'cut' but unlike 'cut' allows for repetitions and reordering of columns.
recut --helpBad item --help at /home/obo/tools/vimtext/recut line 51.
'remove_blanks_lines' does just what you expect and nothing more. Easier to write than "grep -v '^$'".
'revfields' reverses the order of fields on every line.
To sort files by their extensions (suffixes), you might use "ls -1 | revfields --delim=. | sort | revfields --delim=."
To cut the last column of a file, use "revfields | cut -f1".
'round' rounds all the numbers it can find to a specified precision (given as ARG1).
If prec ≶ 10, rounds up to 'prec' decimal places, if prec >= 10, rounds to whole "precs" (such as whole tens, thousands...).
'sample_nth' reads stdin (loads whole in!) and produces FILECOUNT files each with N lines taking evenly selected lines from the input. The remaining lines are printed to stdout. Use '%i' in outname as the placeholder for filecount
sample_nth --helpsample_nth N OUTFILENAME < input > remaining_lines ...reads stdin (loads whole in!) and produces FILECOUNT files each with N lines taking evenly selected lines from the input. The remaining lines are printed to stdout. Use '%i' in outname as the placeholder for filecount Options: --files=FILECOUNT ... default: 1
'see' is nearly equivalent to "sort | uniq -c". Nearly, because some 'uniq's tend to use space instead of tab as the delimiter.
'seqcheck' checks if the stdin contains a non-interrupted sequence of rising integers.
'shuffle' is the missing complement to 'sort'. Correctly shuffles the input lines.
'skip' is the missing complement to 'head'. Moreover, it can also redirect the skipped lines to a file.
skip --helpskip <number_of_lines> Skips the specified number of lines and 'cat's the rest. Options: --save=filename ... save the skipped lines here
'skipbetween' removes specified section(s) of stdin. The sections are identified by a beginning regular expression and ending regular expression. As an option, a specified file can be inserted at the place of the removed section.
Alternatively, skipbetween can be used to select only the marked sections.
skipbetween --helpUsage: skipbetween from_RE to_RE ...will skip all lines from stdin, that are between lines matching from_RE and to_RE (included) ...can skip several such blocks --inverse print only the lines between. --until stop (and exit) right after the first found string --insert=filename ... replace the skipped section with the contents of the file (not really compatible with --inverse) --exclude-markers ... useful with --inverse
'skiplinecmd' is used to launch a specified command in a pipeline but circumvent it for the first --skip=X lines. Useful for all the command that operate on pipe but do not support --skip=X themselves.
'skipseen' is like uniq but the lines do not need to immediately follow each other to be deleted. In other words: only the first occurrence of any line is printed.
Given a regexp, all the matching lines are put in front of the file and all the rest is put below. It's like running 'grep' and 'grep -v'. Optionally a blank line is inserted to separate the blocks.
This tool is extremely handy when manually editing any text-based database.
solve_first --helpusage: solve_first <regular_expression> options: --sort ... the top lines are sorted by $1 of the reg. expression --delim ... add a blank line between the two parts --col=i ... check only the given column to check the reg. expression --inverse ... put above if not matches --blockwise ... match whole blocks of input instead of lines --just-matching ... do not print nonmatching blocks/lines --insens ... case insensitive --skip=<number_of_lines_to_blindly_copy>
'sparse_to_c4.5' converts a sparse matrix representation (from stdin) to input suitable for c4.5. The output is stored in the files ARG1.data and ARG1.names.
sparse_to_c4.5 --helpusage: sparse_to_c4.5 baseoutputfilename Converts sparse matrix input into data suitable for c4.5 Input line sample: answer group1/var1:value1 group2/var3:valueB Options: --test=filename ... build unseen test dataset --coldelim=str ... the delimiter between colums/items on each line --outdelim=str ... the delimiter to use in output.data file --valdelim=str ... the delimiter between varname and value --groupdelim=str ... the delimiter between groupname and varname --use=group1 --use=group2 ... list of group names of attributes to be used --usemore=group1,group2 ... same as --use=group1 --use=group2 --nondirected ... the first column is no special 'answer' attribute --defvalue=str ... the default value if there is no valdelim found in an item --blankvalue=str ... the value if the column is not metioned in a line --ignore-duplicit-values ... keep silent if more (equal) values are assigned to a column
Given a column number, 'split_at_colchange' adds an empty line whenever the value in the column changes. Useful as a preprocessing before 'blockwise'.
'split_even' reads stdin saving the lines to N output files so that each of the files will contain about the same number of lines.
split_even --helpUnknown option: help
Given a column number, 'split_to_files' uses the value of the column to create a separate file for all the lines. Note that we accumulate open files until the end of the input stream, so we may easily run out of file descriptors.
'split_train_test' reads all input lines, shuffles them and then produces a training file and a test file
split_train_test --helpUnknown option: help split_train_test outtrainingfile outtestfile < lines Options: --limit=N ... shuffle all lines, but use only first N --parts=N ... use 1/N of lines as the test data
'suffix' appends ARG1 at the end of every line.
suffix --helpUnknown option: help usage: suffix <prefix> < infile > outfile at /home/obo/tools/vimtext/suffix line 17.
'tab2list' is a complement to list2tab. It reads a 2-dimensional table and produces a list in the form: row label - column label - value.
Think about using tab2list on two separate tables, concatenating the lists and then formatting it to a single table with list2tab.
tab2list --helpusage: tab2list < infile > outfile Options: --keycols=X ... How many cols from left are to be kept (default: 1) --no-headline ... The first line is content line already, there are no column names.
'tdf_to_c4.5' converts a tab-delimited file (from stdin) to input suitable for c4.5. The output is stored in the files ARG1.data and ARG1.names.
tdf_to_c4.5 --helpusage: tdf_to_c4.5 baseoutputfilename Converts tabbed input into data suitable for c4.5 Options: --test=filename ... build unseen test dataset --baseline ... for the independent data set estimate the baseline (assign most frequent) and oracle (if only never seen data are wrong) rates
'tdf_to_xml' converts stdin to stdout, wrapping tab-delimited data with HTML/XML tags: <table><tr num="rowindex"><td num="colindex">...
tdf_to_xml --helptdf_to_html < tab-delimited > table.xml Options: --border=i ... border width --noescape ... do not escape <>& --format=html|docbook ... tags are name table/tr/td or tbody/row/entry
'tpr_tnr' reads a column in the input, calculating true positive rate and true negative rate at various cut-offs. The column is expected to contain 1 in case of a positive example and 0 in case of negative example. If the lines are in perfect order, all positives come before all negatives. Use this script on files sorted non-perfectly to examine the accuracy of the sorting compared to the golden positive-negative dichotomy.
tpr_tnr --helpusage: tpr_tnr <col_index> < infile > outfile at /home/obo/tools/vimtext/tpr_tnr line 26.
'transpose' swaps rows and columns in the given tab-delimited table.
'tt' pads tab-delimited file with spaces so that it looks nicely.
Please read the source code for more options.
'tuples' makes tuples: every n consecutive lines joined on 1 line, delimited with a tab.
'ua' strips off all accents over all latin letters. The name stands for 'unicode to ascii'.
'unlcat' reverses what lcat does. It reads (the first) column of the file and appends all lines to a file with the same name as specified in the column.
unlcat --helpunlcat < input Produces a separate file for all values seen in column 1. Options: --col=N ... use column N instead of 1 --changename=suffix | prefix*suffix ... modify the column value before creating the file
'unsee' reverses the output see, i.e. produces each line that number times as written in the first column.
'unziplines' reads stdin and produces FILECOUNT files with lines "unzipped". If you combine all output files in the original order using 'ziplines', you get the original input stream. Use '%i' in outname as the placeholder for filecount
unziplines --helpunziplines N OUTFILENAME < input > remaining_lines 'unziplines' reads stdin and produces FILECOUNT files with lines 'unzipped'. If you combine all output files in the original order using 'ziplines', you get the original input stream. Use '%i' in outname as the placeholder for filecount.
'update_cols' updates stdin and produces stdout according to the given update file. All files are expected to be tab-delimited.
update_cols --helpupdate_cols update_file <stdin >stdout Options: --paste= ... comma delimited list of indices of cols in the pasted (update) file, with respect to the columns in the main file Use 0 to ignore the column. --keys= ... comma delimited set of col indices from pasted that serve as keys to decide whether the update is to be done on the current line --trim ... strip whitespace from data before mapping Limitations: Update file is read to the memory. Example: --keys=1 --paste=0,3,4 If the value of the column 1 in main input is equal to a value in the column 1 of a line in the update file, then the line of the update file is used as follows: the first column is not used, the second column is used to replace the column 3 and the third column is used to replace the column 4 of the input line.
'weakly_correlated Xcol Ycol OutCol < input > output' replaces OutCol in the given data with a numeric value expressing the distance of the point x,y (from Xcol and Ycol) from the diagonal. The diagonal is the diagonal of the smallest rectangle containing all the data.
weakly_correlated --helpUnknown option: help usage: weakly_correlated Xcol Ycol OutCol < input > output at /home/obo/tools/vimtext/weakly_correlated line 24.
'width' returns the maximum number of characters in one line of stdin/args. Tabs are counted as 1 char!
'xml_to_tdf' converts <table></table> elements to plain tab-delimited text file. If several <table> elements are found, blank line is put in between. Of course, neither nested tables nor span are supported. Tabs in input are replaced with space. All XML tags are deleted. Basic XML entities are expanded.
Given any number of files as args, 'ziplines' produces a file where first comes a line from the first file, then a line from the second, etc. It produces blank lines if any of the files is shorter, so the output is as long as the longest file.
'zwc' is a word counting utility (wc) on (normal and) compressed files. Allows also counting words in each column separately. Allows also grouping using a column value (but always within each file). Allows also quick estimate based on the beginning of the file.
The tools available on this page are distributed under any license at your option if the following conditions are met:
Please keep a reference to the original author (i.e. me, Ondrej Bojar) in any derivative work. If reasonable, make a web-page link here.