Parsing String Mixed with HTML, Words, Numbers, and Dates
I need to extract unique words and numeric values from a string. At this
point I have a function that strips out everything and returns only
alphanumeric words. I need to also recognize when a word is really a date
or a number and prevent the text from being split apart. How can I do
this?
Here is the splitter function I currently have:
Public Function GetAlphaNumericWords(ByVal InputText As String) As Collection
' This function splits the rich text input into unique alpha-numeric only
strings
Dim words() As String
Dim characters() As Byte
Dim text As Variant
Dim i As Long
Set GetAlphaNumericWords = New Collection
text = Trim(PlainText(InputText))
If Len(text) > 0 Then
' Replace any non alphanumeric characters with a space
characters = StrConv(text, vbFromUnicode)
For i = LBound(characters) To UBound(characters)
If Not (Chr(characters(i)) Like "[A-Za-z0-9 ]") Then
characters(i) = 32 ' Space character
End If
Next
' Merge the byte array back to a string and then split on spaces
words = VBA.Split(StrConv(characters, vbUnicode))
' Add each unique word to the output collection
On Error Resume Next
For Each text In words
If (text <> vbNullString) Then GetAlphaNumericWords.Add
CStr(text), CStr(text)
If Err Then Err.Clear
Next
End If
End Function
An example of the output this function currently returns:
GetAlphaNumericWords("Hello World! Test 1. 123.45 8/22/2013 August 22,
2013")
Hello
World
Test
1
123
45
8
22
2013
August
What I really want is:
Hello
World
Test
1
123.45
8/22/2013
No comments:
Post a Comment