Thursday, April 06, 2017

HLint 2.0 - with YAML configuration

Summary: I've just released HLint 2.0, which lets you configure the rules with a YAML file.

I've just released HLint 2.0 to Hackage. HLint has always been configurable by writing a specially crafted Haskell file (e.g. to ignore certain hints or add new ones). With HLint 2.0 we now support YAML configuration files, and moreover encourage them over the Haskell format.

New Features of 2.0

Perhaps the most useful feature of this version is that HLint will search the current directory upwards for a .hlint.yaml file containing configuration settings. If a project includes a .hlint.yaml in the root then you won't need to pass it on the command line, and there's a reasonable chance any editor integration will pick it up as well. This idea was shamelessly stolen from Stylish Haskell.

HLint configuration files have also moved from Haskell syntax with special interpretation to YAML syntax. The Haskell syntax had become quite overloaded and was increasingly confused. The YAML syntax gives a fresh start with a chance to properly specify configuration directly rather than encoding it as Haskell expressions. The YAML configuration format enables a few features in this release, and going forward should enable more. To create a template .hlint.yaml file run hlint --default > .hlint.yaml. HLint continues to work without a configuration file.

In addition to a bunch of little hints, there is now a hint to suggest giving modules explicit export lists. I tend to always favour export lists, using module Foo(module Foo) where in the rare case they seem genuinely too much work. If you object to the hint, simply add to .hlint.yaml:

- ignore: {name: Use module export list}

On the other extreme, if you always want to require a complete and explicit export list (banning the trick above) do:

- warn: {name: Use explicit module export list}

Which enables the off-by-default hint requiring an explicit list. To see which off-by-default hints your program triggers pass the command line argument --show.

The biggest new hint is that you can now mark certain flags/extensions/imports/functions as being restricted - maintaining either a whitelist or blacklist and exceptions. As an example, HLint itself contains the restrictions:

- functions:
  - {name: unsafeInterleaveIO, within: Parallel}
  - {name: unsafePerformIO, within: [Util.exitMessageImpure, Test.Util.ref]}
  - {name: unsafeCoerce, within: []}

These unsafe functions can only be used in the modules/functions listed (or nowhere for unsafeCoerce), ensuring no code sneaks in unexpectedly. As another example:

- flags:
  - default: false
  - {name: [-fno-warn-incomplete-patterns, -fno-warn-overlapping-patterns]}

Here we have used default: false to say that any flags not explicitly allowed cannot be used. If someone accidentally checks in development code with {-# OPTIONS_GHC -w #-} then HLint will catch it. This restriction feature is particularly designed for code reviews.

What Might Have Broken

This release changes a lot of stuff. I am aware the following will no longer work:

  • The HLint2 API has been deleted - I don't think anyone was using it (the HLint and HLint3 ones remain unchanged). I'll continue to evolve the API in future releases.
  • The Haskell configuration file no longer has a way of importing other configuration files. At the same time I made it so the default configuration file and internal hints are always included, which I hopefully offsets most of the impact of ignoring hint imports.
  • I may have broken some other things by accident, but that's why there is a bug tracker.

At the moment the Haskell and YAML configuration files are both supported. However, I'll be phasing out the Haskell configuration in the future, and encourage people to move to the YAML. The conversion should be pretty simple.

Monday, April 03, 2017

Code Review Reviewed

Summary: I used to be mildly against code review on all merges. Now I'm for it. Code review is a good idea for knowledge sharing, not spotting bugs.

For my open-source projects, I accept pull requests from external contributors which I review, but my contributions are probably not routinely looked at by anyone else. I suspect most non-huge open-source projects follow the same pattern. For larger projects code review seems more common, but far from ubiquitous.

Until recently, I was in favour of occasional code review, e.g. towards the end of a project taking a look at the code. The problem with that approach is that it is hard to find the right time - at the end there is little appetite for large-scale refactoring and when the project is busy cranking out code there is no time and the code is still changing. While I was a fan of this idea in theory, it never worked in practice.

Some teams use code review on every merge to master, perhaps several times per person per day. That's something I used to disagree with, but I asked some smart people, and their explanations changed my opinion. The important realisation was thinking about what code review is for.

  • Checking the code works: NO. Code review is not about checking the code works and is bug free. I have tests, and leverage the type system, which is meant to convince me and other readers that my code is correct. For certain very sensitive bits a second set of eyes checking for bugs may be useful, but if all your code is that sensitive it will also be bug-ridden. However, code review may spot the absence or inadequacy of tests.
  • Checking simple rules and conventions: NO. Often projects have rules about which packages/extensions can be used, tab width and line length etc. While these checks can be carried out by a human, they are much better automated. Even when "code reviewing" students work at York University (aka marking) it quickly became clear that I was wasting time on trivialities, and thus wrote HLint to automate it.
  • High-level design: A BIT. Code review should think about the high-level design (e.g. should this be a web server or command line tool), but often the code is too detailed to properly elucidate these choices. High-level design should usually be done on a bug tracker or whiteboard before writing code.
  • Mid-level design: YES. Code review is often very good at challenging details around the mid-level design - e.g. does the choice of Text vs ByteString make sense, is there some abstraction that needs introducing. The author of some code is often heavily influenced by the journey they took to get to some point, by seeing code without that history a reviewer can often suggest a better approach.
  • Spreading knowledge: YES. Code review is great for spreading knowledge of the code to others in the team. Code review requires the code to be written in such a way that someone else can understand it, and ensures that there is someone else who is familiar with it. For larger projects that's invaluable when someone goes on holiday.
  • Team cohesion: YES Code review requires different members of the team to share their work with each other, and to have an appreciation of what other people are working through and the challenges it presents. It's all too easy to declare a problem "easy" without understanding the details, and code review removes that temptation.
  • Training: YES. Code review is also good for teaching new techniques and approaches to the submitter. As an example, some problems are trivial to solve with a generics library, yet some programmers aren't familiar with any - code review helps with ongoing training.

In contrast, I think the disadvantages of code review all revolve around the time required:

  1. It takes other peoples time to do the review, when they could be doing something else.
  2. It takes the submitters time waiting for a review - often the next step is hard to do without the previous step being agreed. As a result, I think code review work should be prioritised.
  3. As a consequence of the two previous issues, code review changes the "unit of work", leading to more changes in a single bundle.

These issues are real, but I think the benefit of knowledge sharing alone outweighs the cost.

Friday, December 23, 2016

Fuzz testing Hexml with AFL

Summary: Hexml 0.1 could read past the end of the buffer for malformed documents. Fuzz testing detected that and I fixed it in Hexml 0.2.

I released Hexml, my fast DOM-based XML parser, and immediately Austin Seipp got suspicious. Here was a moderately large piece of C code, taking untrusted inputs, and poking around in the buffer with memcpy and memchr. He used American Fuzzy Lop (AFL) to fuzz test the Hexml C code, and came up with a number of issues, notably a buffer read overrun on the fragment:

<a b=:fallback

With a lot of help from Austin I setup AFL, fixed some issues with Hexml and with how AFL was being run, released Hexml 0.2 fixing these issues and incorporated AFL into my Travis CI builds.

If you want to actually follow all the steps on your computer, I recommend reading the original GitHub issue from Austin. Alternatively, checkout Hexml and run sh afl.sh.

Building and installing AFL

The first step was to build and install AFL from the tarball, including the LLVM pieces and libdislocator. The LLVM mode allows faster fuzzing, and the libdislocator library provides a library that makes all allocations next to a page boundary - ensuring that if there is a buffer read overrun it results in a segfault than AFL can detect.

An AFL test case

To run AFL you write a program that takes a filename as an argument and "processes" it. In my case that involves calling hexml_document_parse - the full version is online, but the salient bits are:

#include "hexml.c"
... other imports ...

int main(int argc, char** argv)
{
    __AFL_INIT();
    ... read file from argv[0] ...
    document *doc = hexml_document_parse(contents, length);
    hexml_document_free(doc);
    return 0;
}

Here I statically #include the hexml.c codebase and have a main function that calls __AFL_INIT (to make testing go faster), reads from the file, then parses/frees the document. If this code crashes, I want to know about it.

The original AFL driver code used __AFL_LOOP to speed things up further, but that results in a huge number of spurious failures, so I removed it.

Running AFL

To run AFL on my code requires compiling it with one AFL tool, then running it through another. The steps are:

AFL_HARDEN=1 afl-clang-fast -O2 -Icbits cbits/fuzz.c -o $PWD/hexml-fuzz
AFL_PRELOAD=/usr/local/lib/afl/libdislocator.so afl-fuzz -T hexml -x /usr/local/share/afl/dictionaries/xml.dict -i $PWD/xml -o $PWD/afl-results -- $PWD/hexml-fuzz @@

I compile with AFL_HARDEN to detect more bugs, producing hexml-fuzz. I run with libdislocator loaded so that my small buffer overrun turns into a fatal segfault. I give afl-fuzz a dictionary of common XML fragments and a few simple XML documents, then let it run over hexml-fuzz. The interactive UI shows bugs as they occur.

Fixing the bugs

Running AFL on Hexml 0.1 produced lots of bugs within a few seconds. Each bug produces an input file which I then ran through a debugger. While there were a few distinct bug locations, they all shared a common pattern. Hexml parses a NUL-terminated string, and in some cases I looked at a character that was potentially NUL and consumed it in the parsing. That might consume the final character, meaning that any further parsing was reading past the end of the string. I audited all such occurrences, fixed them, and reran AFL. Since then I have been unable to find an AFL bug despite lots of compute time.

Running on CI

I run all my code on Travis CI to ensure I don't introduce bugs, and to make accepting pull requests easier (I don't even need to build the code most of the time). Fortunately, running on Travis isn't too hard:

AFL_PRELOAD=/usr/local/lib/afl/libdislocator.so timeout 5m afl-fuzz -T hexml -x /usr/local/share/afl/dictionaries/xml.dict -i $PWD/xml -o $PWD/afl-results -- $PWD/hexml-fuzz @@ > /dev/null || true
cat afl-results/fuzzer_stats
grep "unique_crashes *: 0" afl-results/fuzzer_stats

I pipe the output of AFL to /dev/null since it's very long. I run for 5 minutes with timeout. After the timeout hits, I display the fuzzer_stats file and then grep for 0 crashes, failing if it isn't there.

Conclusions

Writing C code is hard, especially if it's performance orientated, and if it's not performance orientated you might want to consider a different language. Even if you don't want to use your code on untrusted input, sooner or later someone else will, and even tiny bugs can result in complete exploits. AFL does a remarkable job at detecting such issues and has made Hexml the better for it.

Monday, December 12, 2016

New XML Parser, Hexml

Summary: I've released a new Haskell library, Hexml, which is an incomplete-but-fast XML parser.

I've just released Hexml, a new C/Haskell library for DOM-style XML parsing that is fast, but incomplete. To unpack that a bit:

  • Hexml is an XML parser that you give a string representing an XML document, it parses that string, and returns either a parse error or a representation of that document. Once you have the document, you can get the child nodes/attributes, walk around the document, and extract the text.

  • Hexml is really a C library, which has been designed to be easy to wrap in Haskell, and then a Haskell wrapper on top. It should be easy to use Hexml directly from C if desired.

  • Hexml has been designed for speed. In the very limited benchmarks I've done it is typically just over 2x faster at parsing than Pugixml, where Pugixml is the gold standard for fast XML DOM parsers. In my uses it has turned XML parsing from a bottleneck to an irrelevance, so it works for me.

  • To gain that speed, Hexml cheats. Primarily it doesn't do entity expansion, so &amp; remains as &amp; in the output. It also doesn't handle CData sections (but that's because I'm lazy) and comment locations are not remembered. It also doesn't deal with most of the XML standard, ignoring the DOCTYPE stuff.

If you want a more robust version of Hexml then the Haskell pugixml binding on Hackage is a reasonable place to start, but be warned that it has memory issues, that can cause segfaults. It also requires C++ which makes use through GHCi more challenging.

Speed techniques

To make Hexml fast I first read the chapter on fast parsing with Pugixml, and stole all those techniques. After that, I introduced a number of my own.

  • I only work on UTF8, which for the bits of UTF8 I care about, is the same as ASCII - I don't need to do any character decoding.

  • Since I don't do entity expansion, all strings are available in the source document, so everything simply provides offsets into the input string. In the Haskell API I use constant-time bytestring slices into the source string to present a nice API.

  • The memory model for a document is an array of attributes, an array of nodes, and a root node from the list of nodes. To make sure that scanning a document is fast, each node describes their attributes and direct child nodes in terms of a start and length within the attribute and node arrays. For example, the root node might have attributes 1..5 in the attribute array, and direct children 4..19 in the node array. When scanning the child nodes there are no linked-list operations and everything is cache friendly.

  • To keep the memory compact for attributes, I just have an array and reallocate/copy as necessary. By always doubling the number of attributes on exhaustion I ensure a worst-case of 1-copy per attribute on average.

  • To keep the memory compact for nodes is a bit more complex, as the direct child nodes are not necessarily allocated consecutively, as child nodes may themselves have child nodes. The solution is to have an array of nodes, with contiguous allocation of used child nodes starting at the beginning. To ensure the child nodes are continguous I first put the nodes at the end of the array, then copy them after a child is complete -- in effect using the end of the array as a stack. By always doubling the number of nodes on exhaustion I ensure a worst-case of 2-copies per node on average.

  • When parsing the text in the body of a document, since I don't care about &, the only character that is of any interest is <. That allows me to process much of the document with the highly-optimised memchr.

  • I initially allocate a single buffer that contains the document, a small number of attributes and a small number of nodes, in a single call to malloc. If more attributes/nodes are required they allocate a fresh buffer and just ignore the initially provided one. That ensures that for small documents they don't pay for multiple malloc calls, at the cost of wasting the initial attribute/node allocation on larger documents (which are more memory heavy anyway - so it doesn't matter).

  • I'm pretty sure Hexml could be optimised further. Specifically, I have a recursive descent parser, and it should be a single function with goto. I also process some characters multiple times, mostly to ensure predictable abstraction barriers around the parsing functions, but that could be elimiated with a goto-based approach.

Installing the Haskell Network library on Windows

Summary: This post describes how to install the Haskell network library on Windows, again.

I recently bought a new computer, and tried to install GHC 8.0.1 then upgrade the network library using Cabal. As I have come to expect, it didn't work. Using Git Bash, I got the error:

$ cabal install network-2.6.3.1
Resolving dependencies...
Configuring network-2.6.3.1...
Failed to install network-2.6.3.1
Build log ( C:\Users\Neil\AppData\Roaming\cabal\logs\network-2.6.3.1.log ):
Configuring network-2.6.3.1...
configure: WARNING: unrecognized options: --with-compiler
checking for gcc... C:\ghc\GHC-80~1.1┼║
checking whether the C compiler works... no
configure: error: in `C:/Neil':
configure: error: C compiler cannot create executables
See `config.log' for more details
cabal: Leaving directory '.'
cabal.exe: Error: some packages failed to install:
old-time-1.1.0.3 failed during the configure step. The exception was:
ExitFailure 77

Running -v3 shows the CC variable is being set to C:\ghc\GHC-80~1.1┼║, which looks like a buffer corruption or encoding issue. I tried my previous solution, but it didn't work. My new solution is:

$ cabal unpack network-2.6.3.1
$ cd network-2.6.3.1
$ cabal configure
... fails with a similar error to above ...
$ sh ./configure
$ cabal build
$ cabal copy
$ cabal register

I had to repeat the same pattern for the latest version of old-time, and the same pattern worked.

Another way that works is to use Stack.


Tuesday, December 06, 2016

Undefined Behaviour in C

Summary: I tripped over undefined behaviour in C. It's annoying.

I've recently been writing some C code to parse XML quickly. While working on that project, I inadvertently wrote some code which is undefined according to the C language standard. The code compiled and ran fine using Visual Studio, but under gcc (even at -O0) it corrupted memory, sometimes leading to a segfault, but usually just leading to a wrong answer. The code in question was (see full code at GitHub):

d->nodes.nodes[0].nodes = parse_content(d);

To give some context, d is a structure that contains various pieces of state - what the string to be parsed is, how much we have parsed, along with a pointer to the output nodes. The parse_content function parses the bit inside an XML tag, returning the indicies in nodes which it used.

The complication comes from nodes not being a fixed size, but dynamically resized if the number of nodes exceeds the capacity. For big documents that means parse_content will reallocate d->nodes.nodes.

According to the C spec, the compiler can evaluate the LHS and RHS of an assignment in any order. Since gcc computes the location of d->nodes.nodes[0] before calling parse_content it uses the address of the node before reallocation. After reallocation the address will have changed, and the assignment will be made to the wrong location.

I spotted the bug by inserting printf statements, and in doing so, I had to rewrite the code to:

str content = parse_content(d);
d->nodes.nodes[0].nodes = content;

That fixes the issue, since now the evaluation order is strictly defined. As a simplified example of the same issue:

char* array;

char f() {
    array = malloc(42);
    return 'x';    
}

void test() {
    array = malloc(0);
    array[0] = f();
}

Here the line array[0] = f() might assign to either the result of malloc(0) or malloc(42), at the compilers discretion.

I manually checked if I had made any other such mistakes, and I couldn't find any. Naturally, I wanted to find a static checker that could detect such a mistake, so I tried a bunch of them. I wasn't very successful:

  • Visual Studio 2015 code analysis made me write assert after each malloc, but nothing further.
  • PVS Studio found nothing.
  • Clang undefined behaviour found nothing, and seemingly doesn't work on Windows.
  • GCC undefined behaviour found nothing, and seemingly doesn't work on Windows.
  • RV-Match hit a stack-overflow when running the program.

Wednesday, November 23, 2016

The Haskell.org Website Working Group (HWWG)

Haskell represents both a language and a user community - and moreover a fantastic community full of friends, fun, and deep technical debate. Unfortunately, in recent times the community has started to fracture, e.g. Cabal vs Stack, haskell.org vs haskell-lang.org. These divisions have risen above technical disagreements and at some points turned personal. The solution, shepherded by Simon Peyton Jones, and agreed to by both members of the haskell.org committee and the maintainers of haskell-lang.org, is to form the Haskell Website Working Group (HWWG). The charter of the group is at the bottom of this post.

The goal of the Haskell Website Working Group is to make sure the Haskell website caters to the needs of Haskell programmers, particularly beginners. In doing so we hope to either combine or differentiate haskell.org and haskell-lang.org, and give people clear recommendations of what "downloading Haskell" means. Concretely, we hope that either haskell-lang.org redirects to haskell.org, or that haskell-lang.org ends up being used for something very different from today.



The Haskell Website Working Group (HWWG)

Scope and goals

  • The HWWG is responsible for the design and content of the user-facing haskell.org web site, including tutorials, videos, news, resource, downloads, etc.

  • The HWWG is not responsible for:
    • The infrastructure of haskell.org
    • Toolchains, Hackage, compilers, etc
  • The HWWG focuses on serving users of Haskell, not suppliers of technology or libraries.
  • An explicit goal is to re-unite the haskell.org and haskell-lang.org web sites.

Expected mode of operation

  • HWWG is not responsible for actually doing everything! The web site is on github. Anyone can make a pull request. The general expectation is that uncontroversial changes will be accepted and committed without much debate.
  • If there is disagreement about a proposed change, it's up to the HWWG to engage in (open) debate, and to seek input as widely as time allows, but finally to make a decision.

Membership

Initial membership comprises of:

  • Neil Mitchell (chair)
  • Nicolas Wu
  • Andrew Cowie
  • Vincent Hanquez
  • Ryan Trinkle
  • Chris Done

It is expected the committee will change over time, but the mechanism has not yet been thought about.

Rules of engagement

  • Recognising that honestly-held judgements may differ, we will be scrupulously polite both in public and in private.
  • Recognising that Haskell has many users, and that different users have different needs and tastes, we want haskell.org to be inclusive rather than exclusive, providing a variety of alternative resources (toolchains, tutorials, books, etc) clearly signposted with their intended audiences.
  • Ultimately the haskell.org committee owns the haskell.org URL, but it delegates authority for the design and content of the web site to the HWWG. In extremis, if the haskell.org committee believes that the HWWG is mismanaging the web site, it can revoke that delegation.

Thursday, September 29, 2016

Full-time Haskell jobs in London, at Barclays

Summary: I'm hiring 9 Haskell programmers. Email neil.d.mitchell AT barclays.com to apply.

I work for Barclays, in London, working on a brand new Haskell project. We're looking for nine additional Haskell programmers to come and join the team.

What we offer

A permanent job, writing Haskell, using all the tools you know and love – GHC/Cabal/Stack etc. In the first two weeks in my role I've already written parsers with attoparsec, both Haskell and HTML generators and am currently generating a binding to C with lots of Storable/Ptr stuff. Since it's a new project you will have the chance to help shape the project.

The project itself is to write a risk engine – something that lets the bank calculate the value of the trades it has made, and how things like changes in the market change their value. Risk engines are important to banks and include lots of varied tasks – talking to other systems, overall structuring, actual computation, streaming values, map/reduce.

We'll be following modern but lightweight development principles – including nightly builds, no check-ins to master which break the tests (enforced automatically) and good test coverage.

These positions have attractive salary levels.

What we require

We're looking for the best functional programmers out there, with a strong bias towards Haskell. We have a range of seniorities available to suit even the most experienced candidates. We don't have anything at the very junior end; instead we're looking for candidates that are already fluent and productive. That said, a number of very good Haskell programmers think of themselves as beginners even after many years, so if you're not sure, please get in touch.

We do not require any finance knowledge.

The role is in London, Canary Wharf, and physical presence in the office on a regular basis is required – permanent remote working is not an option.

How to apply

To apply, email neil.d.mitchell AT barclays.com with a copy of your CV. If you have any questions, email me.

The best way to assess technical ability is to look at code people have written. If you have any packages on Hackage or things on GitHub, please point me at the best projects. If your best code is not publicly available, please describe the Haskell projects you've been involved in.