3 views

Understanding Regular Expressions

I just watched a YouTube video called “5 must have skills to become a programmer” by TechLead. The ex-Google employee named the number one skill as regular expressions. He states they are useful for searching large text files for patterns with tools like grep (a tool I’m still unfamiliar with 😅).

He says “every company I’ve worked for has heavily depended on regular expressions”. I agree. There’s no escaping them. Every-time you see input validation you’ll find a regex behind it. I’ve seen code that pulls in files for tests based on regular expressions. Being able to match strings like filenames or their extensions is powerful.

I’ve been frightened of regular expressions since university. I thought I could avoid them. It’s no wonder TechLead is surprised, “students are not really knowledgeable about [them]”. Because they’re terrified! The complicated mess of symbols and characters makes you want to run away and cry.

A typical regular expression (Stack Overflow)

Software developers need to understand the benefits and syntax of regular expressions. As TechLead says, “I use them all the time” and “unlike [the author of this article], I’m like, a pro at it”.

What the hell is a RegEx?

A regular expression matches patterns of characters in a string. Characters are anything in a set like ASCII or Unicode. That means:

  • Letters (international) – abcdefg
  • Numbers – 1234567890
  • Symbols – [email protected]#$%^&*
  • Special characters (spaces, tabs etc)
  • Yes, emojis too
    🍌 🍉 🍇 🍓 🍈 🍒

As an example, consider the following string.

xxxxxxxxAxxxxxxxx

There’s clearly a capital ‘A’ character there. Imagine if we had a file with 10,000 lines of the string with variations that included different characters from A-Z and we wanted to find out how many contained the character ‘G’.

We’d use a regex.

Using a stream reader or similar, we can open the file, read each line one-by-one and check how many times the regular expression finds a match.

‘ xxxxxxxx[A-Z]xxxxxxxx’

This uses the range [A-Z] to match any English character between A-Z with the exact number of ‘x’ on each side.

There’s heaps of It’s fairly cross-platform in my experience. You can check https://www.regular-expressions.info/refflavors.html for compatibility.

When would you use them?

  • Validating an email address
  • Validating a phone number
  • Checking filename extensions are of types – eg, .jpg, .gif, .png
  • Checking filenames
  • Searching a filepath contains a word or combination of words
  • Matching url domain names
  • Searching a pattern that contains only digits
  • Searching a pattern that English alphabet

Using Regex with Grep

It’s often you hear a developer suggest “use grep” to find something. I never knew because I’m a Windows user and ignore any suggestions to use *.nix based tools. I hear it so often that I’m intrigued to know what it is.

grep searches for PATTERNS in each FILE.

That’s taken from the Linux manual page. It’s a regex file search tool. That’s it. Using git bash for Windows you have access to the tool. I booted it up and ran some basic expressions on some notes:

Using grep in Git Bash for Windows

I did it. This isn’t spectacular usage but I’m now familiar with the tool and won’t shy away when asked to ‘grep’ a file.

Windows has an equivalent tool called findstr. I tried using it on Powershell and it’s greps equivalent. It’s less intuitive and threw up different results. I’m not using it properly but you’d expect it to be like grep. Windows just has to be different. Don’t mix this up with find which is its older brother that has not regex support. For some reason you need to add a space after the search string? Weird.

Using findstr in Powershell
Using find in Powershell

Conclusion

Regex is a great tool and certainly does not suck. It comes with a learning curve and if you can master it you will have a skill that will help you through for the rest of your life.

Check these out

https://regexone.com/
https://www.regular-expressions.info
Category Blog

What do you think?