Advent Of Code 2020 Day 4: In Which I Regex
For Day 4, the problem was about validating passport entries in a file. Possible data fields in the passport are:
- byr (Birth Year)
- iyr (Issue Year)
- eyr (Expiration Year)
- hgt (Height)
- hcl (Hair Color)
- ecl (Eye Color)
- pid (Passport ID)
- cid (Country ID)
cid
was an optional field, for the entire problem, as it turns out,
so I’m not sure why it was even there to begin with.
Part one involved making sure that all of the required fields were present in a passport. This part was very easy to do, since it boils down to checking for the presence of certain character sequences in a string (misspellings or other formatting errors were not part of the problem):
fields = set(["eyr", "byr", "hcl", "pid", "ecl", "hgt", "iyr"])
def has_all_fields(passport):
if all([x in passport for x in fields]):
return True
return False
The second part was a little more interesting. Various rules were laid out for the values associated with the different fields, and the second layer of validation involved checking the constraints. The first task was to retrieve the values from the fields. I chose to use regular expressions, because they’re fun:
byr_rgx = re.compile("(?<=byr:)([0-9]{4})")
iyr_rgx = re.compile("(?<=iyr:)([0-9]{4})")
eyr_rgx = re.compile("(?<=eyr:)([0-9]{4})")
hgt_rgx = re.compile("(?<=hgt:)([0-9]{2,3})(cm|in)")
hcl_rgx = re.compile("(?<=hcl:#)([a-f0-9]{6})")
ecl_rgx = re.compile("(?<=ecl:)(amb|blu|brn|gry|grn|hzl|oth)")
pid_rgx = re.compile("(?<=pid:)([0-9]{9})(?![0-9])")
cid_rgx = re.compile("(?<=cid:)([0-9]+)")
For each passport, the different regular expressions were used to extract the values, and then any additional constraints were applied. For example, with the expiration year validation, the regular expression would only match on values that container exactly four digits:
eyr_rgx = re.compile("(?<=eyr:)([0-9]{4})")
Let’s break down the expression being used here. (?<=eyr:)
is a
look-behind assertion
. For an expression preceded by a look-behind to
match, the look-behind must be matched as well, but it’s not part of
what “matched”, so to speak.
[0-9]
is a matching group. Anything in the brackets will match, for
a singe character. The 0-9
is shorthand for 0123456789
, or all of
the digits. To use a matching group like this for more than one
character, you would add a quantifier, which is done here, with the
curly braces: {x}
means “must match exactly x
times”. So, the
expression [0-9]{4}
translates to “exactly four digits”. The entire
expression would be expressed in English as “exactly four digits that
are preceded by eyr:
”.
No match, and the passport was rejected as invalid:
eyr = re.search(eyr_rgx, passport)
if eyr is None:
return False
Next, the value had to satisfy $$2020 \le eyr \le 2030$$:
if not 2020 <= int(eyr.group()) <= 2030:
return False
Most of the rules worked the same way, requiring a certain number of digits and then a range. Height, however, did not work this way, units were also included:
hgt_rgx = re.compile("(?<=hgt:)([0-9]{2,3})(cm|in)")
This regular expression can be read as “match at least two and at most three
digits, and either cm
or in
, all preceded by hgt:
”. The
parentheses here are called “capture groups”, and can be used to
separate parts of a match from each other. Note that parentheses
aren’t capture groups when they’re part of look-behind or look-ahead
assertions.
To validate the height:
hgt = re.search(hgt_rgx, passport)
if hgt is None:
return None
hgt, unit = int(hgt.group(1)), hgt.group(2)
if unit == "cm" and not (150 <= hgt <= 193):
return False
if unit == "in" and not (59 <= hgt <= 76):
return False
If you want to suffer have some fun with regular expressions, you
can try out Regex Crossword, a website
where you can do crossword puzzles that are all regular expressions!
If you work with regular expressions a lot, you should find a regex tester that you like, and use it to help you write expressions. I like Regex101, it has support for Python, which is where I’m usually using regular expressions.
One thing to keep in mind is that there’s no real specification about
what a regular expression engine has to provide. All of the
expressions I wrote, for example, won’t with the <regex>
functionality in the C++ standard library, because C++ doesn’t support
look-behind or look-ahead assertions.