I am creating friendly URLs by doing a simple replacement of spaces on a subject line with a hyphen so the URL ends up looking like:
www.example.com/topic/I-am-asking-for-advice-on-friendly-urls
Then I simply convert back by replacing the hyphens with spaces and search my database for the subject line.
However, what if the subject line already contains a hyphen? For instance if a subject was Dogs - What do you think about them this would become:
www.example.com/topic/Dogs---what-do-you-think-about-them this would then convert all three hyphens after Dogs to three spaces and therefore not be found in the database.
Is my approach all wrong? I already thought about using a placeholder for existing hyphens before converting spaces to hyphens but this kind of messes up the URL or is that the approach?
So why not do the first replacement to handle “multi-hyphens”? you could replace “—” as “-” before replacing “-” with " " 
Because what if the original subject contains — i.e. that is stored in the database.
Why not store the the modified url in your database and search against that instead of trying to convert back? The advantage of that approach is that you can have multiple urls in a related table pointing to the same article, handy if you decide to modify the subject later.
Yes, that sounds like the way, that way I guess even if the URL has been indexed by Google or others and as you say the subject gets changed it still points back to the original post i.e. not breaking the link.
My suggestion is to convert the subject to ASCII encoding, then use EncodeURLComponent on it. Next, replace the URL codes with a dash. Finally, strip any leading or trailing dashes, and it will look pretty clean.
Off the top of my head:
dim modifiedSubject as string = subject.ConvertEncoding( Encodings.ASCII )
modifiedSubject = EncodeURLComponent( modifiedSubject )
// Convert one or more adjacent codes to a single dash
dim rx as new RegEx
rx.SearchPattern = "(?mi-Us)(%[[:xdigit:]]{2}|-)+"
rx.ReplacementPattern = "-"
dim rxOptions as RegExOptions = rx.Options
rxOptions.ReplaceAllMatches = true
modifiedSubject = rx.Replace( modifiedSubject )
// Strip leading and trailing dashes
rx.SearchPattern = "(?mi-Us)^-*(.*[^\\-\\r\
])-*$"
rx.ReplacementPattern = "$1"
modifiedSubject = rx.Replace( modifiedSubject )
modifiedSubject = modifiedSubject.Lowercase
This would take a subject like “-The caf is not here?-” and convert it to “the-cafe-is-not-here”.
(Edit to correct the first pattern.)
I would also store the URL tag in the database.
But when processing, I would also cut it to maybe 50 characters and make sure it’s unique.
Maybe even add a number on the end to make it unique.
Thanks Kem (Cheekily this is for PHP hence off topic channel although I have a Xojo app to access the same db).
Amazingly, look how your Xojo code converts to 4 lines in PHP. The xojo regex operations are so long winded.
$modifiedSubject = mb_convert_encoding($subject,'ASCII');
$modifiedSubject = htmlspecialchars($modifiedSubject);
$modifiedSubject = preg_replace('(%[[:xdigit:]]{2}|-)+','-',$modifiedSubject );
$modifiedSubject = strtolower(trim($modifiedSubject,'-'));
You didn’t include the pattern to strip leading and trailing dashes, unless that was by design.
Last line:
$modifiedSubject = strtolower([b]trim/b);
PHP trim allows you to specify what you are trimming rather than Xojo’s which only trims whitespace.
Ah, I didn’t remember that, if I ever knew.
Kem, I was having problems with your regex expression when converted to PHP as above. Being a novice with Regex I toyed about trial and error and came up with the following which seems to work. Does this look acceptable, it seems to work in all scenarios:
$modifiedSubject = preg_replace("(( +|-+)+)","-",$modifiedSubject );
It has too many repeaters, but that’s not the big issue.
The reason it’s needed is because htmlspecialcharacters
does not replace spaces. In fact, it doesn’t replace much, leaving question marks and “%” alone too. You want urlencode
instead, but that will replace the spaces with “+”.
The other problem is that converting the encoding to ASCII does not translate the characters as it does in Xojo, so “” becomes “?” instead of “e”. I guess you could live with that.
Here’s what I propose:
$modifiedSubject = mb_convert_encoding($subject, 'ASCII');
$modifiedSubject = urlencode($modifiedSubject);
$modifiedSubject = preg_replace('((%[[:xdigit:]]{2}|-|\\+)+)','-',$modifiedSubject );
$modifiedSubject = strtolower(trim($modifiedSubject,'-'));
You inspired me to add Trim, LTrim, and RTrim to my M_String module. ReplaceRegEx was already there.
http://www.mactechnologies.com/index.php?page=downloads#m_string
[quote=88454:@Kem Tekinay]It has too many repeaters, but that’s not the big issue.
The reason it’s needed is because htmlspecialcharacters
does not replace spaces. In fact, it doesn’t replace much, leaving question marks and “%” alone too. You want urlencode
instead, but that will replace the spaces with “+”.
The other problem is that converting the encoding to ASCII does not translate the characters as it does in Xojo, so “é” becomes “?” instead of “e”. I guess you could live with that.
Here’s what I propose:
$modifiedSubject = mb_convert_encoding($subject, 'ASCII');
$modifiedSubject = urlencode($modifiedSubject);
$modifiedSubject = preg_replace('((%[[:xdigit:]]{2}|-|\\+)+)','-',$modifiedSubject );
$modifiedSubject = strtolower(trim($modifiedSubject,'-'));
[/quote]
Thanks Kem that has done the job perfectly. Thank you very much for spending time on this.
[quote=88469:@Kem Tekinay]You inspired me to add Trim, LTrim, and RTrim to my M_String module. ReplaceRegEx was already there.
http://www.mactechnologies.com/index.php?page=downloads#m_string[/quote]
Have you added the option of what is being trimmed i.e. trim($subject,“x”) where x is the character to trim?
Yes, I emulated PHP as closely as I could.