I needed to remove many lines from my sitemap.xml file which Google search console didn’t crawl correctly (reported as soft 404). The lines to remove were similar to the following:
1 2 3 4 5 6 7 8 9 10 11 12 |
<url> <loc>https://www.mysite.com/download/file/fid/537</loc> <lastmod>2028-12-02T11:43:33+00:00</lastmod> </url> <url> <loc>https://www.mysite.com/download/file/fid/278</loc> <lastmod>2028-12-02T11:43:33+00:00</lastmod> </url> <url> <loc>https://www.mysite.com/download/file/fid/1771</loc> <lastmod>2028-12-02T11:43:33+00:00</lastmod> </url> |
These lines were scattered all over sitemap.xml mixed with other records in between.
The solution:
Using regex to find all matching patterns and remove them (replace with nothing).
The pattern used to find the lines:
1 |
^<url>\R<loc>https://www.mysite.com/download/file/fid/.*?</url>$ |
*yes I know the use of “.” above is not entirely correct, but it works perfectly
Note the use of two interesting regex matches:
- “?” – non greedy match – finds the closest match for the ending pattern. Without it the regex above would find the first match up to the last match for the ending pattern
- “\R” – match line return / new line – I had a new line starting right after “<url>”
The replace string was empty and the result was the removal of all the lines I needed to remove, but leaving empty lines in my sitemap.xml file instead of the replaced lines. To solve this I ran another regex find and replace this time with the following pattern:
1 |
^(?:[\t ]*(?:\r?\n|\r))+ |
and nothing as the replace string. All empty lines gone.
Good luck with your replacement adventures!