Test

Java RegExp Tutorial

Arteco - Information Technologies
  • :)
  • :0
  • :D
  • ;)
  • :]
foto Ramón Arnau

Ramón Arnau

Gerente de Arteco Consulting SL

Regular expressions (Regex) are patterns that describe how text strings detect and find information easily within text

Regular expressions are a great ally when it comes to searching and replacing textual content, greatly simplifying the search and substitution of variable terms.

What are regular expressions

Regular expressions, or simply regex, are a great ally when it comes to searching and replacing textual content. They could be considered the unsung heroes of text editing, as they offer enormous capability in searching for variable situations, yet few programmers use them due to the complexity of defining these expressions. However, their use is much easier than it seems, especially when compared to the time required to write a lexer or a grammar as could be done with ANTLR, for example. Moreover, if the programmer becomes familiar with them, they can take advantage of a whole range of commands in the operating system that allow searching and replacing content with them. The way regular expressions are written is practically a standard, so the user can use them in many different contexts.

Java API for regular expressions

The following lines describe how to write and execute regular expressions using the Java API to perform processes on data inputs, validation rules, or specific parsers that are often found in the programmer's everyday tasks.

The Java API for regular expressions is included by default in the JDK and therefore in the JRE without the need to add any additional dependencies. Its starting point is the construction of a Pattern object from the java.util.regex package. Pattern has a static method compile("...") which returns an instance, given a String containing the regular expression. This object should be reused between uses of the same regular expression to avoid incurring unnecessary compilation time repeatedly.

So once compiled, the next step is to apply the expression to a text input through the object's matcher("...") method. This method returns a Matcher object that groups all the results that may have been produced by applying the regular expression (regex) to the input. Obviously, this object cannot be reused for different data inputs. The first and most important result that Matcher provides is whether the regex has been valid or not for the input, which is offered by the boolean find() method. If the response is true, the programmer can obtain the extract of the input that has been compatible with each of the groups defined in the regular expression through the group(int) method. The first group is 1, the second is 2, and so on.

See how all the pieces of the Java API fit together, and later we will describe how to define complex regexps. Imagine a simple case where you want to check if an input contains the word "hello", just like you would do with a String.contains().

Pattern pat = Pattern.compile("hola");
String input = "la caracola dice hola a la paloma";
Matcher mat = pat.matcher(input);
if (mat.find()) {
    System.out.println("regex encontrada");
} else {
    System.err.println("regex NO encontrada");
}

The previous regular expression hola allows searching for the word within the content stored in the variable input. This expression is very simple as it does not provide any variability, just a literal.

However, suppose you want to write another expression that allows you to get the names of the subjects from the input to know who is greeting whom. In this case, you need the expression to accept variable inputs and also extract who the actors are involved in the greeting. The following snippet writes a regex that precisely does that, adding two variable parts to the expression and defining two groups to obtain the subjects.

String regex = "la (\\w+) dice hola a la (\\w+)";
Pattern pat = Pattern.compile(regex);
String input = "la caracola dice hola a la paloma";
Matcher mat = pat.matcher(input);
if (mat.find()) {
    System.out.println("regex encontrada");
    System.out.println("Sujeto activo :"+mat.group(1));
    System.out.println("Sujeto pasivo :"+mat.group(2));
} else {
    System.err.println("regex NO encontrada");
}

This new expression is longer and defines two variable parts, or groups, identified with parentheses containing the symbols \\w+. Java will try to match the input with the expression and will see that the term caracola matches the definition of group 1, while paloma matches the second group. Hence, these can be obtained using mat.group(1), or 2 for the second.

To achieve this, it must be defined what those symbols within parentheses mean. The double backslash indicates that what you want to write in the expression is precisely a backslash. It must be written twice in Java, since if it does not match the escape character, like the one used with \n to indicate the end of a line. However, in this case, what you really want to write is a backslash \.

The next thing that follows is the letter w, which by itself has no value, whereas \w has a specific meaning and is simply indicating any alphanumeric character that allows defining a word (w for word); it does not include spaces, tabs, or exclamation marks. But it only indicates 1 character, hence the last symbol is the +, indicating that what precedes it in the regex can appear 1 or more times. So, the combination in Java of \\w+ means that it should be compared with any input that defines a word (without spaces) no matter how long. The concatenation of these symbols allows finding terms of different lengths like caracola and paloma.

RegExp Entities

Therefore, the programmer can already get an idea of how to go about writing regular expressions. Basically, it is about knowing the symbols that can be used and what each of them means. Next, the most important entities of regex are described:

SymbolMeaning
( )Parentheses allow defining groups with two objectives: one to extract the compatible part of the input and also to apply a recurrence operator over the specified group.
* + ?The first (*) indicates that the preceding symbol can appear 0 or n times. The + indicates that the preceding must appear at least once, with no maximum limit. However, ? indicates that it can occur 0 or 1 time.
{n,m} {n,} {n}The first refers to it must occur between a minimum of n and a maximum of m times. The second only a minimum of n times. And the last exactly n occurrences.
\w \WIn lowercase, it refers to any alphanumeric symbol considered to define a word. It is an alias of azAZ_09a-zA-Z\\\_0-9. In uppercase, it is the negation, meaning the remaining characters not included in \w.
Brackets specify sets of letters or symbols, regardless of order. It even allows ranges with ‘-‘ like from a to z with aza-z. It can indicate the negation of a group of letters, if the first character is ^. For example, 09^0-9 means it cannot contain any digit.
\d \DIn lowercase, it indicates a digit, synonymous with 090-9. In uppercase, it indicates non-digit or, in other words, 09^0-9.
\s \SRefers to characters that are whitespace (\ ), tabs (\t), or line breaks (\n) and carriage returns (\r). In uppercase, it is for symbols that are none of the above.
.The dot is the wildcard that includes any symbol, but only 1. It can be combined with _ to accept all possible entries: «_»
|The vertical bar indicates alternation, meaning the entry that meets the left or right side of | will be compatible. For example, "hello | good".
^ $Respectively indicate the beginning and end of the input sequence. Thus, an expression that includes these will force the input to match completely and not just a segment. Or combinations like start with or end with… For example, "^start.*end$".

These are the most commonly used symbols and constructions when defining regular expressions. See the following usage examples, applied in combination with some of them.

The following expression allows getting the fields of a date in the format dd/MM/yy or dd/MM/yyyy. Note that for each digit, a group is defined within parentheses, which will allow obtaining the value using mat.group(i), where i = 1, 2, or 3.

// parseo de fechas
regex = "(\\d{2})/(\\d{2})/(\\d{2,4})";

In contrast, if you want to locate the text enclosed between the opening and closing tags of an XML file for the name element, it would be as follows:

String regex = "<nombre>(.*)</nombre>";
Pattern pat = Pattern.compile(regex);
String input = "... <nombre>Pepito</nombre> ...";
Matcher mat = pat.matcher(input);
if (mat.find()) {
  System.out.println("Regexp encontrada");
  System.out.println("Sujeto:"+mat.group(1));
} else {
  System.err.println("Regexp NO encontrada");
}

As seen in the examples, not all characters need to be escaped like '<', '>', '/', etc. Only those that have a specific meaning within the regular expressions need to be preceded by backslashes.

Finally, consider the following exercise where you want to write a complete algorithm, where the programmer intends to process a text file, replacing a directive that allows dynamically including content with the content of other files. The directive could look like this:

El siguiente fichero tiene contenido fijo,
y contenido importado con:

#include: ./include.txt

pero una frase final.

Thus, the program should be able to locate and process any regular expression that matches the \#include $<$file_path$>$ directive, evaluate it, and substitute that fragment with the content of the pointed file. To simplify the problem, simple file paths in Unix format will be used, where directories are separated by '/'. Paths starting with either a slash or a dot will be considered absolute or relative to the Java process's working directory, respectively.

Given these requirements, the regular expression that can process the directive should have the following form:

regex = "(#include:\\s+([\\w/.-\]+\\w+\\.txt))";

The regular expression constructs two groups, one nested within the other. The first one will be used to search for the entire directive and replace it with the file content. The second group will specifically collect the file pointed to by its path and name. Right after the literal \#include:, it indicates that one or more whitespace characters may follow, hence \s+. The next step is to open the second group and define a set of unordered symbols, which can be any letter forming a word, and characters '/', '.', or '-'. Then, another word followed by the extension .txt.

Therefore, the main loop of the program that can accomplish the objective of the statement is as follows, where the same regular expression is used to perform multiple searches in the file, as many as \#include directives there are, even those that may exist within other included files:

// proceso principal
String regex = "(#include:\\s+(\[\\w/.-\]+\\w+\\.txt))";
Pattern pat = Pattern.compile(regex);
String input = getFileContent("input.txt");
Matcher mat = pat.matcher(input);
while (mat.find()) {
    String directive = mat.group(1);
    String filepath = mat.group(2);
    String includeContent = getFileContent(filepath);
    input = input.replace(directive, includeContent);
    mat = pat.matcher(input);
}
System.out.println(input);
// fin

// func para leer el contenido de ficheros
String getFileContent(String filepath) throws IOException {
    File file = new File(filepath);
    FileInputStream fis = new FileInputStream(file);
    byte\[\] bytes = new byte\[(int) file.length()\];
    fis.read(bytes);
    fis.close();
    return new String(bytes);
}

To perform some tests, the initial content of the first text file should have at least the following format in the input.txt file:

# INICIO: fichero input.txt
Esto es una prueba de contenido no variable,
con contenido importado:

#include: ./include.txt

Y finalizado con otro contenido no variable.

## FIN: fichero input.txt

And with the following lines in the include.txt file to validate the correct functioning of the directive:

\# INICIO: fichero include.txt
Este contenido es el de include.txt
FIN: fichero include.txt

The output of the process will generate the following lines in the console:

\# INICIO: fichero input.txt
Esto es una prueba de contenido no variable,
con contenido importado:

## INICIO: fichero include.txt

Este contenido es el de include.txt
FIN: fichero include.txt

Y finalizado con otro contenido no variable.
FIN: fichero input.txt

This is just one example of what can be achieved with regular expressions. Finally, note that the Pattern.compile(…) method accepts a second argument that enables the regular expression to ignore case sensitivity or treat newline characters as part of the input within the regular expression, along with other platform-specific options. You can find more information in the Javadoc page of the Pattern class. Regular expressions are essential for efficiently searching and replacing text. Despite their potential, many developers underestimate their usefulness due to the apparent complexity in their definition. However, they are more accessible than they seem, and mastering them offers numerous advantages. In addition to facilitating the search and substitution of variable terms, they allow for a wide range of operations in various contexts. Knowing them expands the repertoire of tools for any programmer, simplifying tedious tasks and saving time in text manipulation. If you want to improve your skills in handling regular expressions and enhance your capabilities in development and programming, explore our website to find specialized resources and training!

Stay Connected

Newsletter

Stay up to date with the latest in technology and business! Subscribe to our newsletter and receive exclusive updates directly to your inbox.

Online Meeting

Don't miss the opportunity to explore new possibilities. Schedule an online meeting with us today and let's start building the future of your business together!

  • :)
  • :0
  • :D
  • ;)
  • :]