In this tutorial, we will cover a few examples in order to understand the rule expression better.

Jump to:

Scenario

Imagine we need to migrate a company portfolio website which contains a collection of company profile pages.

One of the source pages looks like this:

Screenshot of a source page with a logo, a block of copy text, and contact information.

The following is the source HTML of this page. (Note: some irrelevant code of the source HTML has been omitted.)

Screenshot of a piece of code.

Screenshot of a piece of code.

Screenshot of a piece of code.

We have developed a new content type called “Company” in our new Drupal site, which will be used for storing company profiles.

The “Company” content type has the following fields:

  • Name: A text field stores the company name.
  • Summary: A rich HTML field stores the summary paragraphs of the company.
  • Logo: A file field stores the logo image of the company.
  • Established: A date field stores the established date of the company.
  • Website: A link field stores the website URL of the company.
  • Email: A link field stores the email address of the company.
  • Address: A text field contains the company address.
  • Phone: A text field stores the phone number of the company.
  • Fax: A text field stores the fax number of the company.

^

Company Name

To get the company name, we need to get the text inside the <h1> of the page. We can see from the source HTML that the whole page content is inside a <div> with the id of main. The <h1> element is wrapped by a <div> with class col s10. And the parent of this <div> is another <div> with class row.

Therefore, we can identify the <h1> element by following this chain:

<div id="main">
<div class="row">
<div class="col s10">
<h1>

Now we can start to create a rule called “Company name rule” and compose the rule expression to extract the company name.

We’ll first type the block placeholder. In this example, we’ll use a JQUERY block.

{{JQUERY}}

Because all we need is the text inside the <h1> rather than the HTML code, we’ll need to specify the value type to be text.

{{JQUERY[value=text]:}}

Then we need to use the CSS selector to navigate to the <h1> we want. First we will select the <div> containing the page content.

{{JQUERY[value=text]:#main}}

Then we need to select the <div> with class row in its descendants.

{{JQUERY[value=text]:#main .row}}

Now we need to select the <div> with class s10 in its descendants.

{{JQUERY[value=text]:#main .row .s10}}

Finally we can see the <h1> is the child of the currently selected <div>. We’ll just simply select the <h1>.

{{JQUERY[value=text]:#main .row .s10 h1}}

Now the rule expression has been complete. You can then assign this rule as the default rule of the Name field in the content type.

By using XPATH blocks, you can achieve the same result. Here’s an equivalent rule expression in XPATH:

{{XPATH[value=text]://div[@id="main"]/div[@class="row"]/div[@class="col s10"]/h1}}

^

Company Summary

We’ve decided to use the paragraphs between the company name and the contact section of the source page as the company summary in our new content type.

Also, as the first paragraph is just a legacy company ID in the source site, we will not need this information in our new site.

Firstly, we can see that the summary content is inside a <div> with class col s12, which is inside another <div> with class row. The problem here is we cannot simply use the class value in the hierarchy to target our desired element. As you can see, the <div> with id main contains two <div> elements with class row. And our target <div> with class row also contains two <div> elements with class col s12.

Therefore, we need to use the :nth-of-type selector to select the correct element.

Let’s start from the main div element.

{{JQUERY:#main}}

Now we need the second <div> with class row inside the main div.

{{JQUERY:#main .row:nth-of-type(2)}}

Then we need the first <div> with class col s12.

{{JQUERY:#main .row:nth-of-type(2) .col:nth-of-type(1)}}

Now we have reached our desired element, and the rule expression above will return the inner HTML of this element.

There is still another problem we need to deal with. We need to remove the legacy company ID from the company summary we extracted above. First we need to identify the <p> containing the company ID.

{{JQUERY:#main .row:nth-of-type(2) .col:nth-of-type(1) .company-id}}

Basically, the above rule expression will select the <p> with class company-id from the summary content we’ve already extracted.

Now we need to combine these two blocks with the subtraction operation and complete our final rule expression.

{{JQUERY:#main .row:nth-of-type(2) .col:nth-of-type(1)}} - {{JQUERY:#main .row:nth-of-type(2) .col:nth-of-type(1) .company-id}}

The following is an equivalent rule expression in XPATH.

{{XPATH://div[@id="main"]/div[@class="row"][2]/div[@class="col s12"][1]}} - {{XPATH://div[@id="main"]/div[@class="row"][2]/div[@class="col s12"][1]/p[@class="company-id"]}}

^

The Logo field in our new content type is a file field. We only need to pass the source URL of the image to the field. During the migration, PerformX Content Workbench will resolve this URL to the right file value.

To get the logo URL, firstly we need to locate the <img> element.

{{JQUERY:#main .row:nth-of-type(1) .s2 img}}

Because we’re only interested in the source URL of the image, we’ll set the value type to be the src attribute.

{{JQUERY[value=src]:#main .row:nth-of-type(1) .s2 img}}

The equivalent rule expression in XPATH is:

{{XPATH[value=src]://div[@id="main"]/div[@class="row"][1]/div[@class="col s2"]/img}}

^

Established Date

In our example, the established date is in the middle of a paragraph along with other text. The established date we’re interested in is:

While technologies and platforms have evolved considerably since we were established in 1994

We can’t use DOM blocks here because the information doesn’t belong to any individual DOM element. That’s why REGEX blocks are came to use. To locate the date, we’ll use the following regular expression:

/established in \d+/i

Because we only need the digits, we need to wrap the digits in a group and set the match group returned to be that group. Here’s the final rule expression:

{{REGEX[match=1]:/established in (\d+)/i}}

Note: because we only have the year here, PerformX Content Workbench will parse this year into a valid date format during the migration.

^

Website and Email

We can see that the website URL is inside a table under section “Contact”. It will be easier to select the element based on the text value it contains. Therefore, XPATH is the best way to select the element based on its containing text.

Firstly, let’s find the heading element of section “Contact”.

{{XPATH://h2[text()="Contact"]}}

Because the <table> is directly following the <h2>, we’ll select the <table> from the following siblings of <h2>.

{{XPATH://h2[text()="Contact"]/following-sibling::table}}

Now inside the <table>, we need to locate the header cell of the “Website” information.

{{XPATH://h2[text()="Contact"]/following-sibling::table//th[text()="Website"]}}

Then the next <td> sibling is the element we need.

{{XPATH://h2[text()="Contact"]/following-sibling::table//th[text()="Website"]/following-sibling::td}}

Finally, we need the <a> inside the <td>. Also we only need the href attribute value of the <a>.

{{XPATH[value=href]://h2[text()="Contact"]/following-sibling::table//th[text()="Website"]/following-sibling::td/a}}

In the same manner, we can get the email URL.

{{XPATH[value=href]://h2[text()="Contact"]/following-sibling::table//th[text()="Email"]/following-sibling::td/a}}

^

Address

Because our new content type doesn’t have a dedicated field for country, we need to merge the country name with the address.

We’ll start with the address text:

{{XPATH[value=text]://h2[text()="Contact"]/following-sibling::table//th[text()="Address"]/following-sibling::td}}

Then we need to get the country name:

{{XPATH://h2[text()="Contact"]/following-sibling::table//th[text()="Country"]/following-sibling::td}}

Finally, we need to combine them together by using the addition operation.

{{XPATH[value=text]://h2[text()="Contact"]/following-sibling::table//th[text()="Address"]/following-sibling::td}}+{{XPATH://h2[text()="Contact"]/following-sibling::table//th[text()="Country"]/following-sibling::td}}

Please note that the second block doesn’t have a value setting. This is because when more than one block is used in a DOM operation, only the first block can have the value setting.

^

Phone and Fax

Like with many other pieces of contact information, we’ll get the phone number by using the following rule expression.

{{XPATH[value=text]://h2[text()="Contact"]/following-sibling::table//th[text()="Phone"]/following-sibling::td}}

Assume we now have a new requirement that phone numbers must have a country code. All we need to do is create a LITERAL block and concatenate it with the phone number we extracted.

{{LITERAL:+61}} & {{XPATH[value=text]://h2[text()="Contact"]/following-sibling::table//th[text()="Phone"]/following-sibling::td}}

Likewise, we can get the fax number with the following rule expression.

{{LITERAL:+61}} & {{XPATH[value=text]://h2[text()="Contact"]/following-sibling::table//th[text()="Fax"]/following-sibling::td}}